Extract-Transform-Load Using AWS Glue

This solution is a reference architecture for serverless processing of unstructured data using AWS Glue. The primary objective is to demonstrate how to discover structure (fields) in unstructured data, transform it as per business requirements and present it back to the user using SQL queries. It is based on a sample solution proposed by AWS.

Solution Components

The CloudFormation template contains the following:

A REST API provided by AWS API Gateway using which user can upload unstructured data for processing. Supported formats include JSON, CSV and many more
An uploader lambda function which saves the input data to an S3 bucket
An event handler lambda which triggeres AWS Glue service to process the new data
AWS Glue Crawler which scans the data, identifies its structure (fields) and updates tables in the database (Data Catalog)
AWS Glue Job python script which transforms the data and saves it in S3 bucket in parquet format
An S3 bucket which is created by the CloudFormation stack. This bucket stores input and output data
An AWS Custom Resource which ensures that the S3 data bucket is empty when the CloudFormation stack is deleted

Input Parameters

The Cloudformation template requires the following inputs:

dataBucketName: Name of the new S3 bucket which will be created as part of CloudFormation stack. It will store input and output data
inputDataFolderName: Folder name where input data should be stored
outputDataFolderName: Folder name where output data should be stored
tempDataFolderName: Folder name where temp data should be stored
etlScriptPath: Path of the ETL script in S3 bucket. For example: s3://<bucket-name>/glue-etl-transform.py
databaseName: The database name where metadata will be stored
lambdaS3BucketName: S3 bucket name where lambda code resides
inputDataUploaderLambdaZipFilename: Input data uploader lambda code zipfile name
inputDataUploaderLambdaHandler: Input data uploader lambda entry point name
eventHanderLambdaZipFilename: Event handler lambda code zipfile name
eventHandlerLambdaHandler: Event handler lambda entry point name
customResLambdaZipFilename: Custom resource lambda code zipfile name
customResLambdaHandler: Custom resource lambda entry point name

Glue Job ETL Script

The python script performs the following transformations to the input data:

Remove any duplicate records
Convert all column names to lowercase
Convert data into parquet format and save it to S3 bucket

Instructions

Deploy the CloudFormation template in AWS
Note the API Endpoint URL generated in Outputs tab of CloudFormation Stack
Convert an input document into base64 format
Use a tool like Postman to invoke the REST API by providing the following data: { "docname": "inputDocName", "docdata": "<base64 data>" }
The document will be saved in the new S3 bucket and Glue service will be invoked
Open the AWS Glue console and verify that the Glue Job has completed. The output data in parquet format will be stored in the S3 bucket
Finally, to view the output data via SQL queries, use AWS Athena service

Cleanup

Delete the CloudFormation stack
Delete the Glue database (Data Catalog)
Delete the Athena database

Known Limitations and Future Improvements

The commonRole IAM role has multiple permissions merged into one. This can be avoided by using AWS Serverless Application Model
During CloudFormation stack teardown, due to timing issues, sometimes the Custom Resource does not delete the contents of the S3 bucket. This can be fixed by adding appropriate dependencies (CloudFormation DependsOn attribute)
In the current architecture, the table name created by AWS Glue Crawler is not known. It is guessed by using the input folder name in the Glue Job script
The Glue service can be configured to scan input data incrementally by using job bookmarks

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
docs		docs
src		src
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

src

src

LICENSE.txt

LICENSE.txt

README.md

README.md

Repository files navigation

Extract-Transform-Load Using AWS Glue

Solution Components

Input Parameters

Glue Job ETL Script

Instructions

Cleanup

Known Limitations and Future Improvements

About

Releases

Packages

Languages

License

oyenamit/aws-glue-etl-transform

Folders and files

Latest commit

History

Repository files navigation

Extract-Transform-Load Using AWS Glue

Solution Components

Input Parameters

Glue Job ETL Script

Instructions

Cleanup

Known Limitations and Future Improvements

About

Resources

License

Stars

Watchers

Forks

Languages