[MAINTENANCE] Remove legacy DataConnectors #9923

joshua-stauffer · 2024-05-13T18:22:37Z

Description of PR changes above includes a link to an existing GitHub issue
PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
Code is linted - run invoke lint (uses ruff format + ruff check)
Appropriate tests and docs have been updated

For more information about contributing, see Contribute.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

netlify · 2024-05-13T18:22:55Z

✅ Deploy Preview for niobium-lead-7998 canceled.

Name	Link
🔨 Latest commit	`8708885`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/6643a771531e62000862fa82

codecov · 2024-05-13T18:39:22Z

Codecov Report

Attention: Patch coverage is 81.42857% with 26 lines in your changes are missing coverage. Please review.

Project coverage is 77.79%. Comparing base (9c9d551) to head (8708885).
Report is 2 commits behind head on develop.

Files	Patch %	Lines
...asource/fluent/data_connector/s3_data_connector.py	76.92%	6 Missing ⚠️
...a_connector/google_cloud_storage_data_connector.py	70.58%	5 Missing ⚠️
.../fluent/data_connector/file_path_data_connector.py	92.59%	4 Missing ⚠️
...fluent/data_connector/filesystem_data_connector.py	78.57%	3 Missing ⚠️
...ata_connector/azure_blob_storage_data_connector.py	90.00%	2 Missing ⚠️
great_expectations/core/batch.py	0.00%	1 Missing ⚠️
...tasource/fluent/data_asset/path/directory_asset.py	0.00%	1 Missing ⚠️
...ns/datasource/fluent/data_asset/path/file_asset.py	0.00%	1 Missing ⚠️
great_expectations/datasource/fluent/interfaces.py	0.00%	1 Missing ⚠️
...xpectations/datasource/fluent/pandas_datasource.py	0.00%	1 Missing ⚠️
... and 1 more

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9923      +/-   ##
===========================================
+ Coverage    77.33%   77.79%   +0.46%     
===========================================
  Files          490      456      -34     
  Lines        41659    39630    -2029     
===========================================
- Hits         32218    30832    -1386     
+ Misses        9441     8798     -643

Flag	Coverage Δ
3.10	`64.36% <64.28%> (+0.79%)`	⬆️
3.10 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.10 aws_deps	`?`
3.10 big	`?`
3.10 databricks	`?`
3.10 filesystem	`?`
3.10 mssql	`?`
3.10 mysql	`?`
3.10 postgresql	`?`
3.10 snowflake	`?`
3.10 spark	`?`
3.10 trino	`?`
3.11	`64.36% <64.28%> (+0.79%)`	⬆️
3.11 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`53.62% <22.14%> (-0.07%)`	⬇️
3.11 aws_deps	`44.81% <50.71%> (+0.09%)`	⬆️
3.11 big	`53.84% <34.28%> (-0.42%)`	⬇️
3.11 databricks	`46.09% <17.85%> (+0.19%)`	⬆️
3.11 filesystem	`59.32% <46.42%> (+0.25%)`	⬆️
3.11 mssql	`49.08% <17.85%> (+0.29%)`	⬆️
3.11 mysql	`49.14% <17.85%> (+0.30%)`	⬆️
3.11 postgresql	`53.25% <17.85%> (+0.53%)`	⬆️
3.11 snowflake	`46.76% <17.85%> (+0.22%)`	⬆️
3.11 spark	`56.51% <52.14%> (+0.19%)`	⬆️
3.11 trino	`51.17% <17.85%> (+0.40%)`	⬆️
3.8	`64.40% <64.28%> (+0.81%)`	⬆️
3.8 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`53.63% <22.14%> (-0.07%)`	⬇️
3.8 aws_deps	`44.83% <50.71%> (+0.10%)`	⬆️
3.8 big	`53.85% <34.28%> (-0.42%)`	⬇️
3.8 databricks	`46.11% <17.85%> (+0.19%)`	⬆️
3.8 filesystem	`59.33% <46.42%> (+0.26%)`	⬆️
3.8 mssql	`49.06% <17.85%> (+0.30%)`	⬆️
3.8 mysql	`49.12% <17.85%> (+0.30%)`	⬆️
3.8 postgresql	`53.24% <17.85%> (+0.54%)`	⬆️
3.8 snowflake	`46.78% <17.85%> (+0.22%)`	⬆️
3.8 spark	`56.48% <52.14%> (+0.19%)`	⬆️
3.8 trino	`51.16% <17.85%> (+0.40%)`	⬆️
3.9	`64.40% <64.28%> (+0.81%)`	⬆️
3.9 athena or clickhouse or openpyxl or pyarrow or project or sqlite or aws_creds	`?`
3.9 aws_deps	`?`
3.9 big	`?`
3.9 databricks	`?`
3.9 filesystem	`?`
3.9 mssql	`?`
3.9 mysql	`?`
3.9 postgresql	`?`
3.9 snowflake	`?`
3.9 spark	`?`
3.9 trino	`?`
cloud	`0.00% <0.00%> (ø)`
docs-basic	`47.50% <42.14%> (+0.06%)`	⬆️
docs-creds-needed	`48.61% <70.71%> (+0.11%)`	⬆️
docs-spark	`47.35% <42.14%> (+0.18%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joshua-stauffer · 2024-05-14T18:05:56Z

great_expectations/datasource/data_connector/util.py

-        warnings.warn(
-            "In order to access blobs with an InferredAssetGCSDataConnector, \
-            or enabling recursive file discovery with a Fluent datasource, \
-            the delimiter that has been passed to gcs_options in your config must be empty; \
-            please note that the value is being set to None in order to work with the Google SDK."
-        )
-        query_options["delimiter"] = None
-
-    keys: List[str] = []
-    for blob in gcs_client.list_blobs(**query_options):
-        name: str = blob.name
-        if name.endswith("/"):  # GCS includes directories in blob output
-            continue
-
-        keys.append(name)
-
-    return keys
-
-
-def list_s3_keys(  # noqa: C901 - too complex
-    s3, query_options: dict, iterator_dict: dict, recursive: bool = False
-) -> Generator[str, None, None]:
-    """
-    For InferredAssetS3DataConnector, we take bucket and prefix and search for files using RegEx at and below the level
-    specified by that bucket and prefix.  However, for ConfiguredAssetS3DataConnector, we take bucket and prefix and
-    search for files using RegEx only at the level specified by that bucket and prefix.  This restriction for the
-    ConfiguredAssetS3DataConnector is needed, because paths on S3 are comprised not only the leaf file name but the
-    full path that includes both the prefix and the file name.  Otherwise, in the situations where multiple data assets
-    share levels of a directory tree, matching files to data assets will not be possible, due to the path ambiguity.
-    :param s3: s3 client connection
-    :param query_options: s3 query attributes ("Bucket", "Prefix", "Delimiter", "MaxKeys")
-    :param iterator_dict: dictionary to manage "NextContinuationToken" (if "IsTruncated" is returned from S3)
-    :param recursive: True for InferredAssetS3DataConnector and False for ConfiguredAssetS3DataConnector (see above)
-    :return: string valued key representing file path on S3 (full prefix and leaf file name)
-    """  # noqa: E501
-    if iterator_dict is None:
-        iterator_dict = {}
-
-    if "continuation_token" in iterator_dict:
-        query_options.update({"ContinuationToken": iterator_dict["continuation_token"]})
-
-    logger.debug(f"Fetching objects from S3 with query options: {query_options}")
-
-    s3_objects_info: dict = s3.list_objects_v2(**query_options)
-
-    if not any(key in s3_objects_info for key in ["Contents", "CommonPrefixes"]):
-        raise ValueError("S3 query may not have been configured correctly.")  # noqa: TRY003
-
-    if "Contents" in s3_objects_info:
-        keys: List[str] = [item["Key"] for item in s3_objects_info["Contents"] if item["Size"] > 0]
-        yield from keys
-
-    if recursive and "CommonPrefixes" in s3_objects_info:
-        common_prefixes: List[Dict[str, Any]] = s3_objects_info["CommonPrefixes"]
-        for prefix_info in common_prefixes:
-            query_options_tmp: dict = copy.deepcopy(query_options)
-            query_options_tmp.update({"Prefix": prefix_info["Prefix"]})
-            # Recursively fetch from updated prefix
-            yield from list_s3_keys(
-                s3=s3,
-                query_options=query_options_tmp,
-                iterator_dict={},
-                recursive=recursive,
-            )
-
-    if s3_objects_info["IsTruncated"]:
-        iterator_dict["continuation_token"] = s3_objects_info["NextContinuationToken"]
-        # Recursively fetch more
-        yield from list_s3_keys(
-            s3=s3,
-            query_options=query_options,
-            iterator_dict=iterator_dict,
-            recursive=recursive,
-        )
-
-    if "continuation_token" in iterator_dict:
-        # Make sure we clear the token once we've gotten fully through
-        del iterator_dict["continuation_token"]
-
-
-# TODO: <Alex>We need to move sorters and _validate_sorters_configuration() to DataConnector</Alex>
-# As a rule, this method should not be in "util", but in the specific high-level "DataConnector" class, where it is  # noqa: E501
-# called (and declared as private in that class).  Currently, this is "FilePathDataConnector".  However, since this  # noqa: E501
-# method is also used in tests, it can remain in the present "util" module (as an exception to the above stated rule).  # noqa: E501
-def build_sorters_from_config(config_list: List[Dict[str, Any]]) -> Optional[dict]:
-    sorter_dict: Dict[str, Sorter] = {}
-    if config_list is not None:
-        for sorter_config in config_list:
-            # if sorters were not configured
-            if sorter_config is None:
-                return None
-
-            if "name" not in sorter_config:
-                raise ValueError("Sorter config should have a name")  # noqa: TRY003
-
-            sorter_name: str = sorter_config["name"]
-            new_sorter: Sorter = _build_sorter_from_config(sorter_config=sorter_config)
-            sorter_dict[sorter_name] = new_sorter
-
-    return sorter_dict


if these utils were in use, the were migrated to the fluent dataconnector module where they are used

joshua-stauffer · 2024-05-14T18:07:47Z

great_expectations/self_check/util.py

    if batch_definition is None:
-        batch_definition = (
-            my_data_connector.get_batch_definition_list_from_batch_request(
-                batch_request=BatchRequest(
-                    datasource_name="my_test_datasource",
-                    data_connector_name="my_sql_data_connector",
-                    data_asset_name="my_asset",
-                )
-            )
-        )[0]
+        # maintain legacy behavior - standup a dummy LegacyBatchDefinition
+        batch_definition = LegacyBatchDefinition(
+            datasource_name="my_test_datasource",
+            data_connector_name="my_sql_data_connector",
+            data_asset_name="my_asset",
+            batch_identifiers=IDDict(),
+        )


Previously, this function used the ConfiguredAssetSqlDataConnector to build a LegacyBatchDefinition. After running through a bunch of code, it appears that the batch definition never has batch_identifiers, and all the other fields are dummy fields not corresponding to real data. The solution is to just stand up the LegacyBatchDefinition, and tests appear to be satisfied.

joshua-stauffer · 2024-05-14T18:11:28Z

tests/test_utils.py

-        config_defaults={"module_name": "great_expectations.datasource.data_connector"},
-    )
-    introspection_output = data_connector._introspect_db()
+    introspection_output = introspect_db(execution_engine=execution_engine)


This code previously used an InferredAssetSqlDataConnector to introspect the database and obtain table names for the test teardown process. data_connector._introspect_db() had no dependencies on the data connector class outside of the SqlAlchemyExecutionEngine, so I refactored that method to a stand alone function which requires the execution engine as an argument.

joshua-stauffer · 2024-05-14T18:17:22Z

while doing this work, I discovered thattest_sql_asset_workflows.py depends on test data stood up by test_expectations_v3_api.py. It's unrelated to this PR, but will need to be cleaned up as part of https://greatexpectations.atlassian.net/browse/V1-347

joshua-stauffer added 12 commits May 13, 2024 12:51

remove azure data connector

f3ad074

remove glue

ca42c45

remove dbfs

81d6dee

another azure

6531151

remove sql

b4b847e

remove s3

02bee22

remove gcs

823373c

remove file path/ filesystem

6a2c375

remove runtime

46404f6

remove base and utils

32fd97e

refactor utils logic to where its being used

9c08aec

remove asset

8e0c2b9

joshua-stauffer marked this pull request as ready for review May 13, 2024 18:56

cdkini approved these changes May 13, 2024

View reviewed changes

joshua-stauffer enabled auto-merge May 13, 2024 19:36

joshua-stauffer and others added 11 commits May 14, 2024 07:26

remove datetime sorter

dfd553a

remove custom list sorter

3f2db60

remove dictionary sorter

c01b993

remove lexicographic sorter

1542893

remove numeric sorter

ed0546b

remove sorter

d671458

cleanup unused init modules

be67d3d

move batch filter to fluent

d6ba1f2

remove batch kwarg generator

25e12ef

[MAINTENANCE] Remove DatasourceConfig (#9916)

555806f

[MAINTENANCE] Ruff 0.4.4 (#9918)

4c65cc2

joshua-stauffer force-pushed the m/v1/remove_file_path_dc branch from f36ac0f to 4c65cc2 Compare May 14, 2024 14:15

joshua-stauffer disabled auto-merge May 14, 2024 14:16

joshua-stauffer added 5 commits May 14, 2024 10:27

replace data connector logic with dummy batch def

853c809

Merge branch 'develop' into m/v1/remove_file_path_dc

c7eb19c

split out InferredAssetSqlDataConnector logic for use in test cleanup

c677893

cleanup reference in pyproject.toml

7e0da47

fix indent error

8708885

joshua-stauffer commented May 14, 2024

View reviewed changes

joshua-stauffer added this pull request to the merge queue May 14, 2024

Merged via the queue into develop with commit 27eb8d2 May 14, 2024
68 checks passed

joshua-stauffer deleted the m/v1/remove_file_path_dc branch May 14, 2024 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAINTENANCE] Remove legacy DataConnectors #9923

[MAINTENANCE] Remove legacy DataConnectors #9923

joshua-stauffer commented May 13, 2024

netlify bot commented May 13, 2024 •

edited

codecov bot commented May 13, 2024 •

edited

joshua-stauffer May 14, 2024

joshua-stauffer May 14, 2024

joshua-stauffer May 14, 2024

joshua-stauffer commented May 14, 2024

[MAINTENANCE] Remove legacy DataConnectors #9923

[MAINTENANCE] Remove legacy DataConnectors #9923

Conversation

joshua-stauffer commented May 13, 2024

netlify bot commented May 13, 2024 • edited

✅ Deploy Preview for niobium-lead-7998 canceled.

codecov bot commented May 13, 2024 • edited

Codecov Report

joshua-stauffer May 14, 2024

Choose a reason for hiding this comment

joshua-stauffer May 14, 2024

Choose a reason for hiding this comment

joshua-stauffer May 14, 2024

Choose a reason for hiding this comment

joshua-stauffer commented May 14, 2024

netlify bot commented May 13, 2024 •

edited

codecov bot commented May 13, 2024 •

edited