Tesseract (OCR), and Pandoc
Tesseract (OCR)
An easy way to OCR from a PDF
convert -density 300 file.pdf -depth 8 file.tiff
tesseract file.tiff output
Where convert
is an image magick
command for converting the image into TIFF, and tesseract
is a well-known Google open-source project. Thanks to this post.
Troubleshooting
If the TIFF image has an alpha channel:
convert -density 300 source.pdf -depth 8 file.tiff
convert file.tiff -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte output.tiff
convert file.tiff -fill white -draw 'rectangle 10,10 20,20' -background white +matte output.tiff
tesseract output.tiff output
Pandoc
pandoc
can convert from any document format to any other document format. Say from markdown to pdf:
pandoc -s text.md -o output.pdf
Can’t wait to have it on Termux but unfortunately it’s not available there yet. I use groff for text to pdf conversion.