Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: RuntimeError: exception in Python subject: KeyError: 'data' #13

Closed
vikassinghvi2007 opened this issue Mar 15, 2024 · 17 comments
Closed
Assignees
Labels
bug Something isn't working confirmed Problem or bug confirmed priority:high Triaged: high priority issue

Comments

@vikassinghvi2007
Copy link

Steps to reproduce

Getting an error when trying to run the Airbyte showcase example from here: https://pathway.com/developers/showcases/etl-python-airbyte

Relevant log output

Traceback (most recent call last):
  File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/github_issues.py", line 53, in <module>
    pw.run()
  File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/runtime_type_check.py", line 19, in with_type_validation
    return beartype.beartype(f)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<@beartype(pathway.internals.run.run) at 0x1478f8360>", line 129, in run
  File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/run.py", line 47, in run
    ).run_outputs()
      ^^^^^^^^^^^^^
  File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/__init__.py", line 103, in run_outputs
    self.run_nodes(self._graph.global_scope.output_nodes, after_build=after_build)
  File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/__init__.py", line 79, in run_nodes
    self._run(all_nodes, after_build=after_build)
  File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/__init__.py", line 179, in _run
    return api.run_with_new_graph(
           ^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: exception in Python subject: KeyError: 'data'

What did you expect to happen?

Expected to pull the commits into a jsonlines file, as demonstrated.

Version

0.8.3

Docker Versions (if used)

No response

OS

MacOS

On which CPU architecture did you run Pathway?

None

@vikassinghvi2007 vikassinghvi2007 added the bug Something isn't working label Mar 15, 2024
@zxqfd555-pw
Copy link
Contributor

Hi Vikas!

Thank you for reporting this issue!

I wasn't able to reproduce it with Linux, however, it indeed reproduces when running on MacOS. We're researching right now why it happens on MacOS, and how to make this showcase work on both platforms. We will keep you updated and will be back a bit later when we understand why there's a platform-specific difference.

Right now, please consider running it on Linux, if you have such a possibility. Another option is to run it in Docker, but please keep in mind that you may probably need to deal with the DinD issue because the airbyte-serverless connector uses Docker to access Airbyte connectors.

@vikassinghvi2007
Copy link
Author

vikassinghvi2007 commented Mar 15, 2024 via email

@zxqfd555-pw
Copy link
Contributor

After further investigation, I see that the main reason is the updated version of the protocol used in the GitHub connector. Could you please try pinning the connector version in the ./connections/github.yaml config to airbyte/source-github:1.6.0 and check if it helped? Of course, it is a half-measure only to make it possible to run it right now.

I'll shortly commit a fix with the compatible state processing and we will release the new version of Pathway which correctly works with the modern protocols. This release will happen at the beginning of the next week, most likely on Monday.

@vikassinghvi2007
Copy link
Author

vikassinghvi2007 commented Mar 16, 2024 via email

@vikassinghvi2007
Copy link
Author

@zxqfd555-pw :

Although the workaround works, the pipeline breaks after some time, with the following error:

RuntimeError: exception in Python subject: AirbyteSourceException: {"message": "Something went wrong in the connector. See the logs for more details.", "internal_message": "Could not read json file /mnt/temp/catalog.json: Expecting ':' delimiter: line 1 column 8192 (char 8191). Please ensure that it is a valid JSON.", "stack_trace": "Traceback (most recent call last):\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/connector.py", line 65, in _read_json_file\n return json.loads(contents)\n File "/usr/local/lib/python3.9/json/init.py", line 346, in loads\n return _default_decoder.decode(s)\n File "/usr/local/lib/python3.9/json/decoder.py", line 337, in decode\n obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n File "/usr/local/lib/python3.9/json/decoder.py", line 353, in raw_decode\n obj, end = self.scan_once(s, idx)\njson.decoder.JSONDecodeError: Expecting ':' delimiter: line 1 column 8192 (char 8191)\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n File "/airbyte/integration_code/main.py", line 8, in \n run()\n File "/airbyte/integration_code/source_github/run.py", line 17, in run\n launch(source, sys.argv[1:])\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 214, in launch\n for message in source_entrypoint.run(parsed_args):\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 118, in run\n config_catalog = self.source.read_catalog(parsed_args.catalog)\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/source.py", line 91, in read_catalog\n return ConfiguredAirbyteCatalog.parse_obj(cls._read_json_file(catalog_path))\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/connector.py", line 67, in _read_json_file\n raise ValueError(f"Could not read json file {file_path}: {error}. Please ensure that it is a valid JSON.")\nValueError: Could not read json file /mnt/temp/catalog.json: Expecting ':' delimiter: line 1 column 8192 (char 8191). Please ensure that it is a valid JSON.\n", "failure_type": "system_error"}

@zxqfd555-pw
Copy link
Contributor

zxqfd555-pw commented Mar 16, 2024

Hi Vikas!

Thank you for another piece of valuable feedback!

It looks like those failures are spurious and the reason is connected to one of the libraries we used to implement the connector. I did some research and created an issue in the related repo - you can see it linked.

Besides, we can handle it gracefully on our side by implementing retries for these cases. I've done a PR for that internally and they will also be in the release I announced yesterday.

@vikassinghvi2007
Copy link
Author

vikassinghvi2007 commented Mar 16, 2024 via email

@zxqfd555-pw
Copy link
Contributor

Hi Vikas!

We've released the version with an update! It contains a compatibility fix and retries for the spurious errors I've mentioned above.

Feel free to test it!

@vikassinghvi2007
Copy link
Author

vikassinghvi2007 commented Mar 19, 2024 via email

@vikassinghvi2007
Copy link
Author

Hi @zxqfd555-pw , I am now getting error on 1.6.0 version and latest:

Traceback (most recent call last):
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/github_issues.py", line 79, in
issues_table = pw.io.airbyte.read(
^^^^^^^^^^^^^^^^^^^
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/io/airbyte/init.py", line 205, in read
for stream in source.configured_catalog["streams"]:
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/airbyte_serverless/sources.py", line 169, in getattr
return getattr(self.source, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/airbyte_serverless/sources.py", line 102, in configured_catalog
configured_catalog = self.catalog
^^^^^^^^^^^^
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/airbyte_serverless/sources.py", line 97, in catalog
message = self._run_and_return_first_message('discover')
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/airbyte_serverless/sources.py", line 73, in _run_and_return_first_message
message = next(
^^^^^
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/airbyte_serverless/sources.py", line 74, in
(message for message in messages if message['type'] not in ['LOG', 'TRACE']),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/airbyte_serverless/sources.py", line 68, in _run
raise AirbyteSourceException(json.dumps(message['trace']['error']))
airbyte_serverless.sources.AirbyteSourceException: {"message": "Config validation error: None is not of type 'string'", "internal_message": "None is not of type 'string'", "stack_trace": "Traceback (most recent call last):\n File "/airbyte/integration_code/main.py", line 8, in \n run()\n File "/airbyte/integration_code/source_github/run.py", line 17, in run\n launch(source, sys.argv[1:])\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 214, in launch\n for message in source_entrypoint.run(parsed_args):\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 116, in run\n yield from map(AirbyteEntrypoint.airbyte_message_to_string, self.discover(source_spec, config))\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 150, in discover\n self.validate_connection(source_spec, config)\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/entrypoint.py", line 171, in validate_connection\n check_config_against_spec_or_exit(connector_config, source_spec)\n File "/usr/local/lib/python3.9/site-packages/airbyte_cdk/sources/utils/schema_helpers.py", line 175, in check_config_against_spec_or_exit\n raise AirbyteTracedException(\nairbyte_cdk.utils.traced_exception.AirbyteTracedException: None is not of type 'string'\n", "failure_type": "config_error"}

@zxqfd555-pw
Copy link
Contributor

Hi Vikas!

I suspect that you have a not-filled optional field in the config or an unspecified required field. At least, this part of an error makes me think that:

None is not of type 'string'

Could you please ensure that your config doesn't contain them? If there are optional fields you don't use, please just delete them. If there are required fields that are not filled, please fill them in. You can also refer to the config example that we give in the tutorial.

Also, for the reference, here is a config I used to run it:

source:
  docker_image: "airbyte/source-github:latest"
  config:
    credentials:
      option_title: "PAT Credentials"
      personal_access_token: <github token here>
    repositories:
      - pathwaycom/pathway
    api_url: "https://api.github.com/"

It's stored in connections/github.yaml and I specify this path in the path parameter of the connector.

@vikassinghvi2007
Copy link
Author

Thanks, I tried after correcting my config file, with the latest airbyte connector version. I still get this:

Traceback (most recent call last):
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/github_issues.py", line 89, in
pw.run()
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/runtime_type_check.py", line 19, in with_type_validation
return beartype.beartype(f)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<@beartype(pathway.internals.run.run) at 0x13fafc5e0>", line 129, in run
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/run.py", line 47, in run
).run_outputs()
^^^^^^^^^^^^^
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/init.py", line 103, in run_outputs
self.run_nodes(self._graph.global_scope.output_nodes, after_build=after_build)
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/init.py", line 79, in run_nodes
self._run(all_nodes, after_build=after_build)
File "/Users/vikassinghvi/Documents/GitHub/veloraapp/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/init.py", line 179, in _run
return api.run_with_new_graph(
^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: exception in Python subject: KeyError: 'data'

It still does process extraction on older airbyte connector version 1.6.0.

@zxqfd555-pw
Copy link
Contributor

Please make sure you have the latest pathway version. If you use pip, can do it by calling pip show pathway. The required version is 0.8.4.

If the version is below 0.8.4, you would need to upgrade it. It can be done, for example, by reinstalling pathway with pip uninstall pathway && pip install pathway.

@umarbasha007
Copy link

Hi,

Thanks, I tried below suggestions,

Suggestion 1:

source:
  docker_image: "airbyte/source-github:latest"
  config:
    credentials:
      option_title: "PAT Credentials"
      personal_access_token: <github token here>
    repositories:
      - pathwaycom/pathway
    api_url: "https://api.github.com/"

in connections/github.yaml

Suggestion 2:

pip uninstall pathway && pip install pathway

to update pathway version to 0.8.4 and verified it using pip show pathway.

However, the run is breaking after 1-2 mins, and below is the error, I'm getting. Please help !!

Traceback (most recent call last):
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/github_issues.py", line 99, in <module>
    pw.run()
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/venv/lib/python3.11/site-packages/pathway/internals/runtime_type_check.py", line 19, in with_type_validation
    return beartype.beartype(f)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<@beartype(pathway.internals.run.run) at 0x16df04ae0>", line 129, in run
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/venv/lib/python3.11/site-packages/pathway/internals/run.py", line 47, in run
    ).run_outputs()
      ^^^^^^^^^^^^^
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/__init__.py", line 107, in run_outputs
    self.run_nodes(self._graph.global_scope.output_nodes, after_build=after_build)
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/__init__.py", line 83, in run_nodes
    self._run(all_nodes, after_build=after_build)
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/venv/lib/python3.11/site-packages/pathway/internals/graph_runner/__init__.py", line 203, in _run
    raise error from None
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/github_issues.py", line 85, in process_github_issue
    return extract_info_from_github_issues(issue)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/github_issues.py", line 63, in extract_info_from_github_issues
    category, feature, summary = categorize_and_summarize_github_issue(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/github_issues.py", line 33, in categorize_and_summarize_github_issue
    if keyword in issue_title or keyword in issue_body:
                                 ^^^^^^^^^^^^^^^^^^^^^
TypeError: argument of type 'NoneType' is not iterable
Occurred here:
    Line: issues_info = issues_table.select(data=pw.apply(process_github_issue, pw.this.data))
    File: /Users/techmonk/Documents/Hustle/Velora_ai/veloraapp-main/services/github_issues.py:95

@zxqfd555-pw
Copy link
Contributor

Hi @umarbasha007

I am not sure it's related to the Pathway<->Airbyte connector. But, as I can see, you have a variable issue_body in your code that was set to None. Then, when you do a condition: keyword in issue_body, it fails, because NoneType is indeed non-iterable. So, code-wise, I would recommend checking that issue_body is None. You can replace it with an empty string if set to None. Another piece of advice here is to log such incorrect entries and see how they look in the Github interface.

The main point to consider is that we don't control the data that comes from any of the airbyte connectors, so it can be a good idea to refer to the connector docs and the format, for example, here. You can navigate by the link for "Issues" and check the schema for the fields you're interested at. That is, for body fields, the part of the schema looks as follows:

      "body": {
        "description": "Contents of the issue",
        "type": [
          "string",
          "null"
        ],

So, null is indeed a possible variant.

@dxtrous
Copy link
Member

dxtrous commented Mar 29, 2024

Hi @umarbasha007, have you been able to verify/resolve the issue on your side? Do any further problems persist, or should we close the issue?

@dxtrous
Copy link
Member

dxtrous commented May 9, 2024

@umarbasha007 we will be closing this issue as resolved on May 13 unless we hear from you by then.

@dxtrous dxtrous closed this as completed May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmed Problem or bug confirmed priority:high Triaged: high priority issue
Projects
None yet
Development

No branches or pull requests

4 participants