Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glue_table_properties only accepts and returns values as strings #139

Closed
davidbridgwood opened this issue Feb 15, 2024 · 1 comment
Closed
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@davidbridgwood
Copy link

davidbridgwood commented Feb 15, 2024

this is an extension of changes discussed here and implemented here.

current situation

The current implementation allows a user to provide an additional key (glue_table_properties), which itself must be a dict of key-value pairs to the Metadata object which when passed to GlueTable().generate_from_meta with update_table_properties set to True will add the key-value pairs to to the table-properties in the glue data catalog.

i.e.

>>> meta = {
... "name": "some_table",
... ...,
... "glue_table_properties": {"key_1": "value_1", "key_2": "value_2"}
... }

>>> _ = GlueTable().generate_from_meta(
...     metadata=meta,
...     database_name="some_database",
...     table_location="s3://some/location",
...     update_table_properties=True
... )

The same is true in reverse when using GlueTable().generate_to_meta with get_table_properties set to True, a metadata object is returned with the table-properties from the glue-catalog available within the glue_table_properties key.

i.e.

>>> meta = GlueTable().generate_to_meta(
...     database="some_database",
...     table="some_table",
...     get_table_properties=True
... )

>>> meta
{
"name": "some_table",
...,
"glue_table_properties": {"key_1": "value_1", "key_2": "value_2"}
}

why we need more than strings

The current implementation requires the values in the glue_table_properties to be strings, however there are many cases where different or more complex data-types could be required. For example primary-keys are often stored as lists, and indeed this is a requirement of model-generation within create-a-derived-table which implements its own extraction of table properties and transforms certain values to lists as required (see here).

If we were to migrate model-generation to make use of generate_to_meta (which i'd very much like to do!), then we would ideally need a way for mojap-metadata to handle those data-type transformation internally, rather than having to add something custom within our own code.

Similarly when using generate_from_meta to add the table-properties to the glue-catalog it would provide a simpler user experience if the glue_table_properties dictionary could be passed with these more complex types with mojap-metadata handling the conversion to string internally before writing to the glue-catalog.

proposal

Would probably need to discuss implementation to come up with the simplest, most user-friendly approach. One possible suggestion is to provide an optional schema for glue_table_properties which mojap-metadata can then use to do any conversions it needs and raise meaningful exceptions if those transformations fail.

i.e.

>>> meta = {
... "name": "some_table",
... ...,
... "glue_table_properties": {
... "key_1": "a string",
... "key_2": 10,
... "key_3": ["a", "list"],
... "key_4": True}
... }

>>> glue_table_properties_schema = {
... "key_1":  str,
... "key_2": int,
... "key_3": list,
... "key_4": bool
... }
... }

# internally this converts the values in `glue_table_properties` into strings based on the
# `glue_table_properties_schema` schema before writing to glue
>>> _ = GlueTable().generate_from_meta(
...     metadata=meta,
...     database_name="some_database",
...     table_location="s3://some/location",
...     update_table_properties=True,
...     table_properties_schema=glue_table_properties_schema
... )

# internally this converts the strings returned from glue back to the datatypes as defined in
# `glue_table_properties_schema` schema
>>> meta = GlueTable().generate_to_meta(
...     database="some_database",
...     table="some_table",
...     get_table_properties=True,
...     table_properties_schema=glue_table_properties_schema
... )

>>> meta
{
"name": "some_table",
...,
"glue_table_properties": {
    "key_1": "a string",
    "key_2": 10,
    "key_3": ["a", "list"],
    "key_4": True
}

it seems like there is a fair amount of functionality in metadata.py which handle data-type transformations with it may be possible to use/extend for these purposes.

@davidbridgwood davidbridgwood added enhancement New feature or request good first issue Good for newcomers labels Feb 15, 2024
@davidbridgwood
Copy link
Author

closing as not planned based on discussions - we will leave type-conversions to users and enforce strings for glue-table-properties written via mojap-metadata

@davidbridgwood davidbridgwood closed this as not planned Won't fix, can't repro, duplicate, stale May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

1 participant