Thoughts / approaches to processing 990s

Included in this toybox:

pdftk
Some stuff from brew (make sure to run brew update first):
- brew install imagemagick
- brew install tesseract

Get the first page of a PDF:
pdftk pdfs/52-6078041_990PF_200706.pdf shuffle 1 output 52-607.pdf

Turn that first page into an image:
./pdf-splitter 52-607.pdf 'img/%.d.jpg' 1200px

Get a section of the page to process:
convert img/1.jpg -crop 490x20+185+225 img/1.crop.jpg

OCR the image (you'll get a file named 1.txt):
tesseract img/1.crop.jpg 1

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
pdfs		pdfs
.gitignore		.gitignore
README.md		README.md
identify_image_sections.html		identify_image_sections.html
pdf-splitter		pdf-splitter