OCR using Tesseract on multipage PDFs

OCR using Tesseract on multipage PDFs

Tesseract is a cracking piece of code to do OCR. Using the below sources for inspiration the following script can be used to take a pdf of x pages long and turn it into x pages of text. These can then be combined into a single file following some cleansing.

#!/bin/sh
PAGES=4 # set to the number of pages in the PDF
SOURCE=pamphlet-low.pdf # set to the file name of the PDF
OUTPUT=pamphlet-low # set to the final output file
RESOLUTION=300 # set to the resolution the scanner used (the higher, the better)

#xpdf-pdfinfo pamphlet-low.pdf | grep Pages: | awk '{print $2}' | tail -n 1

#touch $OUTPUT
for i in `seq 1 $PAGES`; do
    convert -density $RESOLUTION -depth 8 $SOURCE\[$(($i - 1 ))\] page$i.png
#    tesseract page$i.tif >> $OUTPUT
    tesseract page$i.png $OUTPUT$i
done