Home
Reusable utilities for working with Glue PySpark jobs
pip install glue-utils
This library does not include pyspark
and aws-glue-libs
as dependencies as they are already pre-installed in Glue's runtime environment.
To help in developing your Glue jobs locally in your IDE, it is helpful to install pyspark
and aws-glue-libs
. Unfortunately, aws-glue-libs
is not available through PyPI so we can only install it from its git repository.
pip install pyspark==3.3.0
pip install git+https://github.com/awslabs/aws-glue-libs.git@master
To make your local environment as close to Glue's runtime as possible, use the versions specified in this document:
glue_utils
currently has the following features:
ManagedGlueContext
ManagedGlueContext
instantiates a GlueContext
(while initializing a Job
) and wraps it in a ContextManager
to ensure that the Job
will be committed at the end of the ContextManager
.
See below.
import sys
from awsglue.utils import getResolvedOptions
from glue_utils.context import ManagedGlueContext
options = getResolvedOptions(sys.argv, [])
with ManagedGlueContext(job_options=options) as glue_context:
dynamicframe = glue_context.create_dynamic_frame_from_options(
connection_type="s3",
connection_options={
"paths": ["s3://awsglue-datasets/examples/us-legislators/all/persons.json"],
"recurse": True,
},
format="json",
)
dynamicframe.printSchema()