lambda-OCRmyPDF

Adapting the python library OCRmyPDF to run as an AWS Lambda Function.

From the OCRmyPDF readme:

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted.

Purpose

The purpose of this application was to adapt the OCRmyPDF application/library to be run on AWS servers to serve as a dynamic document OCR processing service. We loved the implementation of OCRmyPDF but felt it would work well if adapted to an Amazon Lambda Function.

What's in the repo?

This repository contains all external libraries required by OCRmyPDF compiled on and extracted from an Amazon Linux EC2 instance. It also contains all python packages compiled on and extracted from an Amazon Linux EC2 instance. Lastly, it features some minor changes to the OCRmyPDF source itself to make it Lambda friendly. The resulting zipfile extracts to around 261,459,415 bytes, which comes just under the 262,144,000 byte limit set by AWS.

Calling the Event

The event currently supports only a few, basic, parameters, which we intend on expanding. The parameters are:

Key	Description
awsRegion	The region where the S3 bucket is located
s3.bucket.name	The name of the S3 bucket
s3.object.key	The key for the object in the S3 bucket
pages	The pages parameter for OCRmyPDF. Ex: "1", "3-5", "1,3-5,8"
doBackup	Creates a mirror of the S3 object appended with `.bak`

Installation

Download Latest Release

Download the latest release from this repository's releases page.

[Optional]: You can also download the source here and compress it to your own .zip file. However you may need to update the permissions of the files within the /bin and /lib folders, as discovered in this issue.

Create the Function

Go to S3 and upload the downloaded zip file to an S3 bucket of your choosing. Keep track of the url of the file for later.
Access the AWS Lambda dashboard and click on Create Function.
Ensure Author from scratch is selected
Name your function as you please.
Make sure that your Runtime is set to Python 3.6
If you do not have an IAM Role set up for S3 access, set one up with Read, Write access on S3. I used AWS's AWSLambdaExecute policy as a base.

Setup the Function

manually

Under the Function code section:
- Set the Code entry type field to Upload a file from Amazon S3
- Set the Runtime field to Python 3.6
- Set the handler field to apply-ocr-to-s3-object.apply_ocr_to_document_handler
- Copy and paste the S3 object URL from above into the Amazon S3 link URL field
- Increase the timeout to max (900 sec)
- Set memory size to your desired size
- Required for OCRmyPDF is ~256MB. lambda max is 3008MB
- Denote - incresing the memory size may reduce the runtime drastically and increases the lambda per-time cose see aws pricing
Under the Environment Variables section:
- Set the below environment variables to their respective paths
- PATH : /var/task/bin
- PYTHONPATH : /var/task/python
- TESSDATA_PREFIX : /var/task/tessdata

Or by script

lambda_name="your_lambda_name"
s3_bucket="your_bucket"
s3_file_key="your_s3_file_key.zip"

zip_file_name="lambda-ocrtopdf.zip"
download_url="https://github.com/chronograph-pe/lambda-OCRmyPDF/releases/download/v1.0-alpha/lambda-ocrtopdf.zip"

wget -O $zip_file_name -L $download_url

aws s3 cp  $zip_file_name s3://$s3_bucket/$s3_file_key
aws lambda update-function-code --function-name $lambda_name --s3-bucket $s3_bucket --s3-key $s3_file_key
aws lambda update-function-configuration --function-name $lambda_name \
                                        --handler apply-ocr-to-s3-object.apply_ocr_to_document_handler \
                                        --environment "Variables={PATH=/var/task/bin,PYTHONPATH=/var/task/python,TESSDATA_PREFIX=/var/task/tessdata}" \
                                        --timeout 900 \
                                        --memory-size 3008 \
                                        --runtime "python3.6"

Test the Function

The following test configuration can be added to lambda to test the functionality. Upload any pdf called input.pdf to an S3 bucket and run this test configuration:

input_test_configuration.json

{
  "pages": "1",
  "awsRegion": "[YOUR AWS REGION HERE]",
  "s3": {
    "bucket": {
      "name": "[YOUR BUCKET NAME HERE]"
    },
    "object": {
      "key": "input.pdf"
    }
  }
}

To Do:

Add additional language support
Continue to trim down python packages

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
bin		bin
lib		lib
python		python
tessdata		tessdata
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
apply-ocr-to-s3-object.py		apply-ocr-to-s3-object.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

lib

lib

python

python

tessdata

tessdata

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

apply-ocr-to-s3-object.py

apply-ocr-to-s3-object.py

Repository files navigation

lambda-OCRmyPDF

Purpose

What's in the repo?

Calling the Event

Installation

Download Latest Release

Create the Function

Setup the Function

manually

Or by script

Test the Function

To Do:

About

Releases 2

Packages

Contributors 2

Languages

License

chronograph-pe/lambda-OCRmyPDF

Folders and files

Latest commit

History

Repository files navigation

lambda-OCRmyPDF

Purpose

What's in the repo?

Calling the Event

Installation

Download Latest Release

Create the Function

Setup the Function

manually

Or by script

Test the Function

To Do:

About

Resources

License

Stars

Watchers

Forks

Languages