OCR using Tesseract on multipage PDFs
Tesseract is a cracking piece of code to do OCR. Using the below sources for inspiration the following script can be used to take a pdf of x pages long and turn it into x pages of text. These can then be combined into a single file following some cleansing.
- http://ubuntuforums.org/showthread.php?t=880471
- http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tesseract/
- http://code.google.com/p/tesseract-ocr/wiki/ReadMe
- https://help.ubuntu.com/community/OCR
- http://elmargol.wordpress.com/2011/01/27/howto-scan-multiple-pages-to-a-pdf-file-and-ocr-using-tesseract-on-archlinux/
#!/bin/sh PAGES=4 # set to the number of pages in the PDF SOURCE=pamphlet-low.pdf # set to the file name of the PDF OUTPUT=pamphlet-low # set to the final output file RESOLUTION=300 # set to the resolution the scanner used (the higher, the better) #xpdf-pdfinfo pamphlet-low.pdf | grep Pages: | awk '{print $2}' | tail -n 1 #touch $OUTPUT for i in `seq 1 $PAGES`; do convert -density $RESOLUTION -depth 8 $SOURCE\[$(($i - 1 ))\] page$i.png # tesseract page$i.tif >> $OUTPUT tesseract page$i.png $OUTPUT$i done |