Skip to content

vrasneur/ocr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ocr.sh: a bash script to OCR PDF files easily

Author

Vincent Rasneur vrasneur@free.fr

Required programs

  • pdftk
  • ghostscript
  • imagemagick
  • tesseract
  • aspell (optional)

Remarks

By default, the script uses the French dictionaries of tesseract and aspell. Use the -t argument to change the tesseract dictionary. Use the -a argument to change the aspell dictionary.

By default, the script does not spell-check the output text. To do this, you must add -s (or use the -a argument).

Usage

To OCR a PDF file

ocr.sh document.pdf

To OCR a PDF file and spell-check each page

ocr.sh -s document.pdf

To OCR an english PDF and spell-check it

ocr.sh -t eng -a en document.pdf

Output files

For a PDF file named doc1.pdf, the script:

  • creates a directory named doc1
  • for each PDF page, a file named pg_<number>.txt is created inside this directory

Or, if the -c argument is used, the script:

  • creates a directory named doc1
  • creates a unique file named doc1/doc1.txt

About

ocr.sh: a bash script to OCR PDF files easily

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages