New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Reshuffle.viaRandomKey timeout since version 2.54.0 #31095
Comments
Likely due to #28853 Could you please open a support ticket so the support can take a look of your job |
another change is that since Beam 2.54.0 batch job is using Dataflow runner v2 by default. Was your 2.53.0 job show "runner v2: disabled" and 2.54.0 "runner v2: enabled" in the Dataflow UI? |
Hello @Abacn , thank you for the reply. Yes, the pipeline with 2.54.0 shows "runner v2: enabled" and the one with 2.53.0 shows "runner v2: disabled" |
I'm not sure how can I open a support ticket |
@kennknowles FYI. For the support ticket, please check https://cloud.google.com/dataflow/docs/support/getting-support For 2.54.0, can you try to disable Runner V2 ( |
Thanks for reporting this. Those four error messages are from the service's point of view (after 4 it fails a batch job). I wonder if we see a crash loop or some such thing on the worker. |
Is it possible for you to run your job on 2.53.0 with Runner V2 enabled? |
So to summarize, we could check these configurations:
If the problem is with "new reshuffle" then we in fact have a flag that will allow you to use the old expansion while still upgrading to Beam 2.54.0: |
I ran the pipeline with version 2.54.0 disabling the Runner V2, as suggested, and it worked. |
I also ran the pipeline with version 2.53.0 with the Runner V2 enabled, and it failed! |
So apparently the problem is the Runner V2 and not the Reshuffle, am I right? |
Alright, thank you so much for isolating this! We would love to be able to minimally reproduce it and fix it. From the lack of error, it may be a pathological performance problem causing the work item to timeout. Can you share the size of elements and the overall size of data shuffled? (I don't think any other factors could impact this transform) |
One thing occurred to me: the entire stage after the shuffle failed. From the error message, this stage includes a fusion of many steps that are not part of the reshuffle. The issue could be in the subsequent steps as well, or an interaction between them. Specifically, the step name This is the fusion of the steps:
|
The input of the
|
@kennknowles so maybe the problem might be on BigQueryIO write? This step currently is:
|
Easy to test: you can comment out the BQ write and see if you still reproduce the error. |
Through Google's Cloud Support channels you can get someone who can really dig in to the details of the logs and the metrics. But from an outside perspective, doing some trial and error to see if this reproduces with different things removed from the pipeline and different data sizes could be productive. My thoughts after your last comment:
|
I did like you suggested and isolated the pipeline, removing the step to Write to BigQuery, using Beam Version 2.55.1, and the pipeline succeeded! |
And just to check - it succeeded on runner v2? So then it could be the BigQuery write or it could be an interaction of the fused steps. |
yields many changes
|
Did removing the BigQuery write step for Runner V2 work for older Beam versions than 2.55.1? |
Yes, with Runner V2 |
This is to identify if the change was on Beam side or Dataflow Runner V2 side |
Yes, I tested with version 2.54.0 with Runner V2 and it worked |
Is it possible for you to share the job IDs or create a dataflow support ticket so that we can examine this more closely? |
What happened?
I have a Batch pipeline that reads data from Firestore and writes it to BigQuery
The pipeline was working with Apache Beam Version 2.52.0.
When I updated to 2.55.1 and it failed, I tested with 2.55.0 and 2.54.0 and it also failed, it only worked back with 2.53.0.
I fails when reading from Firestore, this PTransform internally uses
Reshuffle.viaRandomKey()
and it's where it fails.The error message is not very informative, it contains only timeout errors:
I noticed the implementation of the Reshuffle changed
Graph using version 2.54.0
Graph using version 2.53.0
This is the commit that changed the Reshuffle implementation from @kennknowles
Could someone investigate or give me some ideas of what might be the problem?
Thanks
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: