show word occurence in pdfs on commandline

Do you want to know how often words occur in a pdf file? And sort them by the most occurring word:

pdftotext mypdf.pdf - | sed "s/[[:cntrl:][:digit:][:punct:]]//g" | tr '[:space:]' '[\n*]' | sort | uniq -c | sort -bnr

Let’s break it down step by step:

pdftotext mypdf.pdf -

displays the pdf content on the command-line


sed "s/[[:cntrl:][:digit:][:punct:]]//g"

replaces all control characters (cntrl), all numbers (digits) and all punctuation characters (punct) with an empty string.
See here for character classes.

tr '[:space:]' '[\n*]'

replaces all spaces with a newline


sort | uniq -c | sort -bnr

The last part sorts the output, groups unique lines and prefix them with the amount and finally sort them again
with ignored leading blanks (-b), sort numeric (-n), in reverse order (-r)

PDF Frontpage to jpg

First find all the files you want to convert and store their filenames in a file

find my-pdf-folder/ -type f -name "*.pdf" > pdffiles.txt

Iterate through that files and make a jpeg from them

while read -r LINE; do pdftoppm -f 1 -singlefile -jpeg "$LINE" "$LINE"; done < pdffiles.txt

Finally move all jpgs to a separate location if necessary

find my-pdf-folder/ -type f -name "*.jpg" -exec mv {} jpgs/ \;