Tesseract (OCR), and Pandoc

Tesseract (OCR)

An easy way to OCR from a PDF

convert -density 300 file.pdf -depth 8 file.tiff 
tesseract file.tiff output

Where convert is an image magick command for converting the image into TIFF, and tesseract is a well-known Google open-source project. Thanks to this post.

Troubleshooting

If the TIFF image has an alpha channel:

convert -density 300 source.pdf -depth 8 file.tiff
convert file.tiff -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte output.tiff
convert file.tiff -fill white -draw 'rectangle 10,10 20,20' -background white +matte output.tiff
tesseract output.tiff output

Pandoc

pandoc can convert from any document format to any other document format. Say from markdown to pdf:

pandoc -s text.md -o output.pdf

Can’t wait to have it on Termux but unfortunately it’s not available there yet. I use groff for text to pdf conversion.