Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beam SQL Extension raised an error when the input row contained iterable fields #31117

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yyfhust
Copy link

@yyfhust yyfhust commented Apr 26, 2024

Please add a meaningful description for your change here

Related to : #31118

Upon utilizing the Beam SQL within our pipeline, we encountered a noteworthy exception. It appears that when the input row encompasses fields of iterable types, the process fails irrespective of whether the iterable fields are included in the SQL filter condition or not. This issue stems from the fact that the Beam SQL extension attempts to construct an output row schema based on the input schema, and unfortunately, it currently lacks support for iterable types.

Consider the following example:

Given an inputRow in the schema:

field1: String
field2: Integer
field3: Array<String>
field4: ITERABLE

And ANY Beam SQL condition such as :
field2 > 1
or Even
1 = 1

The pipeline will invariably fail, yielding the following error: Exception in thread "main" java.lang.UnsupportedOperationException: Unable to get ITERABLE at org.apache.beam.sdk.extensions.sql.impl.rel.BeamCalcRel$InputGetterImpl.getBeamField(BeamCalcRel.java:603).

(My first contribution to beam , kindly advise how to test lol)


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@Abacn
Copy link
Contributor

Abacn commented May 8, 2024

thanks for the fix. While this sounds a valid fix, would you mind sharing the full stack trace. My understanding here is that there are two issues

(1) Beam SQL does not filter out fields not used

(2) Iterable isn't supported by Beam SQL

A full stack will be helpful to investigate (1), and possible optimization

The Iterable field type, introduced in #10003 was meant to be different than ARRAY. However the fix here treats it the same as ARRAY. It may have performance implications, and not work for large iterables? Maybe add a comment here or a TODO.

until it's optimized for Iterable, one can just write

case ARRAY:
case ITERABLE:
    return ....

so no need duplicate the line.

Also, there are switch (fieldType.getTypeName()) branches in several places in BeamCalRel, could it be all fixed for consistency?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants