Optimise View.asList() side inputs for iterating rather than for indexing. #31087

robertwb · 2024-04-23T23:27:49Z

The current implementation is, essentially, a distributed hashmap from integer keys to the list contents, mediated by each upstream worker starting at a random value to minimize overlaps and emitting sufficient metadata to map this onto the contiguous range [0, N). This provides optimal random-access performance, but very poor iteration performance (essentially having to do a key lookup for every advance, and as the keys are hashed and distributed rather than clustered numerically, there is little to no amortiziation in these lookups for adjacent items.

Given that most uses for List side inpupts are merely to gather a collection of values (the user has no control over the ordering when materialized) and the high costs of providing random access, this is probably the wrong tradeoff for most pipelines.

This is an update-incompatable change and so has been guarded by the update compatibility version flag. The old behavior can be explicilty asked for via a new AsList#withRandomAccess() method.

Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
Update CHANGES.md with noteworthy changes.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

robertwb · 2024-04-23T23:28:11Z

R: @kennknowles @damccorm

github-actions · 2024-04-23T23:29:23Z

Stopping reviewer notifications for this pull request: review requested by someone other than the bot, ceding control

…xing. The current implementation is, essentially, a distributed hashmap from integer keys to the list contents, mediated by each upstream worker starting at a random value to minimize overlaps and emitting sufficient metadata to map this onto the contiguous range [0, N). This provides optimal *random-access* performance, but very poor *iteration* performance (essentially having to do a key lookup for every advance, and as the keys are hashed and distributed rather than clustered numerically, there is little to no amortiziation in these lookups for adjacent items. Given that most uses for List side inpupts are merely to gather a collection of values (the user has no control over the ordering when materialized) and the high costs of providing random access, this is probably the wrong tradeoff for most pipelines. This is an update-incompatable change and so has been guarded by the update compatibility version flag. The old behavior can be explicilty asked for via a new AsList#withRandomAccess() method.

…ault...

kennknowles · 2024-04-29T15:01:03Z

Totally agree. I do know that this was actually an explicit decision. The history as I understand it:

We already had View.asIterable that was a simple iterator, but windowed side inputs had awful performance because it was just a filter on the whole side input
We added View.asList primarily as an indicator that the per-window value could be cached in memory after the first read.
We added ISM format and for some reason emphasized the random access behavior of the Java List class.

TBH I would be perfectly happy if we had never allowed random access for list side inputs, leaving that to map side inputs.

kennknowles

Does this make it identical to View.asIterable?

Is it possible to retain it as a window --> iterable map for efficient access by window? (I don't know if that is already implied by how this interacts with code elsewhere or what)

kennknowles · 2024-04-29T15:02:55Z

sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionViews.java

+   *
+   * <p>For internal use only.
+   */
+  public static class ListViewFn3<T> extends ViewFn<IterableView<T>, List<T>> {


I don't love that these are just named 1, 2, 3...

Yes, this is equivalent to View.asIterable() plus implementing the List methods so nothing breaks. Good question about the Window -> Iterable map; this is handled at a lower level, but I don't know all the details there (though in that case I can see that constructing the mapping would be more worthwhile). In the interest of being conservative while capturing the most important gains I'll restrict this to the global window case.

(I kept the name for ListViewFn2 just in case there are pipelines serializing it as data.)

kennknowles · 2024-05-01T17:05:25Z

This seems a likely culprit for the failure at https://ge.apache.org/s/6b22rnlopcdxk/tests/overview?outcome=FAILED

robertwb · 2024-05-01T17:22:36Z

I think you're right. This was masked by

/Users/robertwb/Work/beam/incubator-beam/sdks/java/extensions/avro/build/generated-test-avro-java/org/apache/beam/sdk/extensions/avro/schemas/TestAvro.java:135: error: cannot find symbol
  protected static final org.apache.avro.data.TimeConversions.TimestampConversion TIMESTAMP_CONVERSION = new org.apache.avro.data.TimeConversions.TimestampConversion();

robertwb · 2024-05-01T17:34:10Z

This seems a likely culprit for the failure at https://ge.apache.org/s/6b22rnlopcdxk/tests/overview?outcome=FAILED

#31149

Abacn · 2024-05-02T15:52:44Z

This also breaks several Java PostCommit: https://github.com/apache/beam/actions/workflows/beam_PostCommit_Java_DataflowV1.yml?query=event%3Aschedule

faling tests:

testSequentialWrite (org.apache.beam.sdk.io.gcp.spanner.SpannerWriteIT) failed

java.lang.NullPointerException: Unknown producer for value SimplePCollectionView{tag=Tag<org.apache.beam.sdk.values.PCollectionViews$SimplePCollectionView.<init>:1443#e1789f47d74ca86c>, viewFn=org.apache.beam.sdk.values.PCollectionViews$IterableBackedListViewFn@b96f9be, coder=VoidCoder, windowMappingFn=GlobalWindowMappingFn{}, pCollection=wait/To wait view 0/Sample.Any/Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).output [PCollection@1669047811]} while translating step wait/Wait/Map
	at org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:1269)
	at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.getProducer(DataflowPipelineTranslator.java:571)

testFhirIO_Import[DSTU2] (org.apache.beam.sdk.io.gcp.healthcare.FhirIOWriteIT) failed
testFhirIO_Import[STU3] (org.apache.beam.sdk.io.gcp.healthcare.FhirIOWriteIT) failed

java.lang.NullPointerException: Unknown producer for value SimplePCollectionView{tag=Tag<org.apache.beam.sdk.values.PCollectionViews$SimplePCollectionView.<init>:1443#53e357e8e9199108>, viewFn=org.apache.beam.sdk.values.PCollectionViews$IterableBackedListViewFn@7bc0659a, coder=VoidCoder, windowMappingFn=GlobalWindowMappingFn{}, pCollection=FhirIO.Write/FhirIO.Import/Wait On File Writing/To wait view 0/Sample.Any/Flatten.Iterables/FlattenIterables/FlatMap/ParMultiDo(Anonymous).output [PCollection@876681835]} while translating step FhirIO.Write/FhirIO.Import/Wait On File Writing/Wait/Map
	at org.apache.beam.vendor.guava.v32_1_2_jre.com.google.common.base.Preconditions.checkNotNull(Preconditions.java:1269)
	at org.apache.beam.runners.dataflow.DataflowPipelineTranslator$Translator.getProducer(DataflowPipelineTranslator.java:569)
	at org.apache.beam.runners.dataflow.DataflowPipelineTranslator.translateSideInputs(DataflowPipelineTranslator.java:1205)

same failure seen in https://github.com/apache/beam/actions/workflows/beam_PostCommit_Java_ValidatesRunner_Dataflow.yml?query=event%3Aschedule

…for indexing. (apache#31087)" This reverts commit 7f7bc3e.

robertwb · 2024-05-02T23:21:12Z

Hopefully #31163 should fix it. Otherwise we can submit the rollback.

github-actions bot added the java label Apr 23, 2024

robertwb force-pushed the list-side-input-iter branch from d0a859f to b163a54 Compare April 23, 2024 23:34

robertwb added 5 commits April 23, 2024 17:17

Apparently pipeline options can't have arbitrary methods, even if def…

03fc0c4

…ault...

checkstyle

f38690c

fix the fix

bf3eae5

Merge branch 'master' into list-side-input-iter

295b440

Merge branch 'master' into list-side-input-iter

9ab2906

kennknowles approved these changes Apr 29, 2024

View reviewed changes

robertwb added 3 commits April 29, 2024 09:39

Better naming for ListViewFn3, restrict to global windows.

d64e194

(I kept the name for ListViewFn2 just in case there are pipelines serializing it as data.)

fix

7582e80

Merge branch 'master' into list-side-input-iter

3673ee6

robertwb merged commit 7f7bc3e into apache:master Apr 30, 2024
28 of 30 checks passed

robertwb added a commit to robertwb/incubator-beam that referenced this pull request May 2, 2024

Revert "Optimise View.asList() side inputs for iterating rather than …

d67bc74

…for indexing. (apache#31087)" This reverts commit 7f7bc3e.

robertwb mentioned this pull request May 7, 2024

Raise upper bound for jinja2. #31214

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise View.asList() side inputs for iterating rather than for indexing. #31087

Optimise View.asList() side inputs for iterating rather than for indexing. #31087

robertwb commented Apr 23, 2024

robertwb commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

kennknowles commented Apr 29, 2024

kennknowles left a comment

kennknowles Apr 29, 2024

robertwb Apr 29, 2024

kennknowles commented May 1, 2024

robertwb commented May 1, 2024

robertwb commented May 1, 2024

Abacn commented May 2, 2024 •

edited

robertwb commented May 2, 2024

Optimise View.asList() side inputs for iterating rather than for indexing. #31087

Optimise View.asList() side inputs for iterating rather than for indexing. #31087

Conversation

robertwb commented Apr 23, 2024

GitHub Actions Tests Status (on master branch)

robertwb commented Apr 23, 2024

github-actions bot commented Apr 23, 2024

kennknowles commented Apr 29, 2024

kennknowles left a comment

Choose a reason for hiding this comment

kennknowles Apr 29, 2024

Choose a reason for hiding this comment

robertwb Apr 29, 2024

Choose a reason for hiding this comment

kennknowles commented May 1, 2024

robertwb commented May 1, 2024

robertwb commented May 1, 2024

Abacn commented May 2, 2024 • edited

robertwb commented May 2, 2024

Abacn commented May 2, 2024 •

edited