Pass original message down through conversion for storage write api #31106

johnjcasey · 2024-04-25T17:23:40Z

Enable users to specify an alternate way to generate the table row for the error output for BQIO's storage write api.

The user passes in a function of ElementT -> TableRow, and we maintain an index of the original elements passed in to BQIO. If the function exists, we use it to generate the error row, instead of the default behavior of emitting the failure directly.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

github-actions · 2024-04-25T18:34:40Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions · 2024-04-26T15:34:31Z

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java.
R: @Abacn for label io.

Available commands:

stop reviewer notifications - opt out of the automated review tooling
remind me after tests pass - tag the comment author after tests pass
waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

…or handling

johnjcasey · 2024-05-01T18:50:01Z

@Abacn @ahmedabu98 could you take a look at this?

Abacn

Thanks, the change lgtm. Have one thing to confirm (cc'd below)

...d-platform/src/test/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiSinkFailedRowsIT.java

Abacn · 2024-05-02T14:56:05Z

...google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiLoads.java

@@ -52,16 +52,18 @@
 /** This {@link PTransform} manages loads into BigQuery using the Storage API. */
 public class StorageApiLoads<DestinationT, ElementT>
    extends PTransform<PCollection<KV<DestinationT, ElementT>>, WriteResult> {
-  final TupleTag<KV<DestinationT, StorageApiWritePayload>> successfulConvertedRowsTag =
-      new TupleTag<>("successfulRows");
+  final TupleTag<KV<DestinationT, KV<ElementT, StorageApiWritePayload>>>


Here it changed PTransform output element type. Do we need some change in BigQueryTranslation to preserve upgrade compatibility? cc: @chamikaramj

or is there plan to setup precommit test for bigquery pipeline upgrade? so tests can auto detect this (like kafka upgrade project)

in theory, its within the overall BQ transform, so it might work?

I think changing output element type and the coder here could break streaming update compatibility in general.

Have you considered using the updateCompatibilityVersion option ?

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/options/StreamingOptions.java

Line 45 in c531f89

String getUpdateCompatibilityVersion();

That would require us to maintain two implementations of Streaming Inserts, one with this change, and one without, right? I think that would be prohibitive in general for beam IOs

This should be called out in CHANGES.md if we have to do these breaking changes. But I recommend updateCompatibilityVersion here.

I think you can manually test by running a streaming job using the old version and running a replacement job that includes your changes with the --update option.

https://cloud.google.com/dataflow/docs/guides/updating-a-pipeline#Launching

Dataflow streaming team might also have internal tests for update compatibility (but I haven't looked into this).

With updateCompatibilityVersion option you could fork based on the Beam version to preserve update compatibility. See following for an example.

beam/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Reshuffle.java

Line 77 in 8328d07

String requestedVersionString =

I think we should have a broader version on updateCompatibilyVersion, because this looks like we will, over time, start accruing legacy code based on specific beam versions. That strikes me as eventually unmaintainable.

Because we will need to do this whenever we change the shape of a PCollection

I think one way to do that might be to use schema'd PCollections everywhere and evolve the schema when we want to do changes to the PCollection structure. But for the time being we have to fork the code to preserve compatibility. BTW not all structure changes will break compatibility so some changes can be done without forking (Dataflow link above has some details) but generally coder changes can break update compatibility (and probably Dataflow compatibility check will fail).

Ok. I'll make the forking change for this when I have a chance

Abacn · 2024-05-02T14:58:54Z

Also going to run some load test to see if it has performance implications

update:

"AvgInputThroughputElementsPerSec": 51674.9203125,

identical to 2.55.0 (51205), 2.56.0 (47579)

ahmedabu98

LGTM as well. Just one suggestion for performance.

P.S. I see @Abacn's load test results, feel free to ignore

...ud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiConvertMessages.java

…rce-record-storage-write

…sion

Pass original message down through conversion for storage write api

6f14e95

github-actions bot added java io gcp labels Apr 25, 2024

johnjcasey added 2 commits April 25, 2024 13:25

re-add run with plugin file

6c3ee4e

re-add run with plugin file

b020244

spotless

6e0603b

github-actions bot added the Next Action: Reviewers label Apr 26, 2024

Wire custom error transform function into Write Unsharded Records err…

d835d03

…or handling

johnjcasey marked this pull request as draft April 29, 2024 15:50

johnjcasey added 6 commits April 29, 2024 12:07

fix build errors

469d975

Wire custom error handling into sharded writes

a72ac72

add test cases for verifying that the error function is used

e095419

fix integration test cases

caafd2d

attempt to mitigate big bytes test failure

6131b77

add usage of user defined error handling to row size check

45653a7

johnjcasey marked this pull request as ready for review May 1, 2024 18:49

johnjcasey assigned Abacn and ahmedabu98 May 2, 2024

Abacn reviewed May 2, 2024

View reviewed changes

fix nullable import

65de02e

ahmedabu98 approved these changes May 6, 2024

View reviewed changes

...ud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/StorageApiConvertMessages.java Show resolved Hide resolved

johnjcasey added 4 commits May 13, 2024 13:11

Merge remote-tracking branch 'origin/master' into feature/persist-sou…

14e6673

…rce-record-storage-write

Fork expansion of Storage Api Loads based on update compatibility ver…

8b5cc02

…sion

remove underscores in class names

d196102

fix underscore

f6c7a4f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass original message down through conversion for storage write api #31106

Pass original message down through conversion for storage write api #31106

johnjcasey commented Apr 25, 2024 •

edited by AnandInguva

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 26, 2024

johnjcasey commented May 1, 2024

Abacn left a comment

Abacn May 2, 2024

johnjcasey May 2, 2024

chamikaramj May 6, 2024

johnjcasey May 6, 2024

liferoad May 6, 2024

chamikaramj May 8, 2024

johnjcasey May 8, 2024

johnjcasey May 8, 2024

chamikaramj May 8, 2024

johnjcasey May 8, 2024

Abacn commented May 2, 2024 •

edited

ahmedabu98 left a comment •

edited

Pass original message down through conversion for storage write api #31106

Are you sure you want to change the base?

Pass original message down through conversion for storage write api #31106

Conversation

johnjcasey commented Apr 25, 2024 • edited by AnandInguva

GitHub Actions Tests Status (on master branch)

github-actions bot commented Apr 25, 2024

github-actions bot commented Apr 26, 2024

johnjcasey commented May 1, 2024

Abacn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Abacn commented May 2, 2024 • edited

ahmedabu98 left a comment • edited

Choose a reason for hiding this comment

johnjcasey commented Apr 25, 2024 •

edited by AnandInguva

Abacn commented May 2, 2024 •

edited

ahmedabu98 left a comment •

edited