-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add smaller synthetic data set for in-process tests. #1618
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 22 of 31 files at r1, 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @renjiezh and @SanjayVas)
src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt
line 81 at r2 (raw file):
) // TODO(@ple13): Use the actual vidToIndexMap instead of an empty map when a smaller dataset is
nit: remove this TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 31 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @renjiezh)
a discussion (no related file):
The current tests all use the same large population size, but a smaller number of impressions for local LLv2 tests since the full impression set takes too long and would consume too much memory (i.e. wouldn't fit in a standard GHA runner).
Is the concern here for HMSS time or memory? I'm assuming that we can generate the VID to index map at build time since we know the population spec (e.g. via a Bazel rule), so that shouldn't affect the amount of time it takes to run the test.
src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue
line 28 at r2 (raw file):
_populationSpec: "/etc/\(#AppName)/config-files/synthetic_population_spec_small.textproto" _eventGroupSpecs: [
I believe we want to continue to use the large sets with a realistic population size for cloud tests. @kungfucraig can confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @renjiezh and @SanjayVas)
src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue
line 28 at r2 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
I believe we want to continue to use the large sets with a realistic population size for cloud tests. @kungfucraig can confirm.
+1 - by default we should have the large population available for the cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 29 of 31 files reviewed, 2 unresolved discussions (waiting on @ple13 and @SanjayVas)
a discussion (no related file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
The current tests all use the same large population size, but a smaller number of impressions for local LLv2 tests since the full impression set takes too long and would consume too much memory (i.e. wouldn't fit in a standard GHA runner).
Is the concern here for HMSS time or memory? I'm assuming that we can generate the VID to index map at build time since we know the population spec (e.g. via a Bazel rule), so that shouldn't affect the amount of time it takes to run the test.
It is because of the memory. So we will use small population for in-process tests.
src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue
line 28 at r2 (raw file):
Previously, kungfucraig (Craig Wright) wrote…
+1 - by default we should have the large population available for the cloud
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 17 of 31 files at r1, 8 of 9 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @renjiezh)
src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt
line 82 at r3 (raw file):
val vidRange = requireNotNull(syntheticPopulationSpec.vidRange) {
This is a protobuf field, right? Those cannot be null.
Code quote:
syntheticPopulationSpec.vidRange
src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt
line 87 at r3 (raw file):
val vidIndexMap = VidToIndexMapGenerator.generateMapping( salt = "".toByteStringUtf8(),
nit
Suggestion:
ByteString.EMPTY
src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt
line 88 at r3 (raw file):
VidToIndexMapGenerator.generateMapping( salt = "".toByteStringUtf8(), vidUniverse = (vidRange.start until vidRange.endExclusive).toList(),
You can probably save on memory for cases where the VID universe is a contiguous range the vidUniverse param type be Sequence instead of a List. Even for cases where it's multiple ranges with gaps, a Sequence can be much more compact. If it's truly well-distributed across the numerical space, a List can be trivially wrapped in a Sequence.
@kungfucraig WDYT?
Leaving this as optional in case you want to address in a separate PR.
src/test/kotlin/org/wfanet/measurement/integration/deploy/gcloud/GCloudPostgresInProcessLifeOfAMeasurementIntegrationTest.kt
line 43 at r3 (raw file):
* TODO(Kotlin/kotlinx.coroutines#3865): Switch back to CoroutinesTimeout when fixed. */ @get:Rule val timeout: Timeout = Timeout.seconds(180)
Shouldn't this either stay the same or be shorter if it's using a smaller set?
Code quote:
180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 30 of 31 files reviewed, 3 unresolved discussions (waiting on @kungfucraig and @SanjayVas)
src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt
line 82 at r3 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
This is a protobuf field, right? Those cannot be null.
Done.
src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt
line 88 at r3 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
You can probably save on memory for cases where the VID universe is a contiguous range the vidUniverse param type be Sequence instead of a List. Even for cases where it's multiple ranges with gaps, a Sequence can be much more compact. If it's truly well-distributed across the numerical space, a List can be trivially wrapped in a Sequence.
@kungfucraig WDYT?
Leaving this as optional in case you want to address in a separate PR.
For in-process tests, I think it is fine to be a list as they are intermediate variables. The sizes of them are not large.
For synthetic data test, it will be handled by craig's VidIndexMap class that takes in the population spec.
src/test/kotlin/org/wfanet/measurement/integration/deploy/gcloud/GCloudPostgresInProcessLifeOfAMeasurementIntegrationTest.kt
line 43 at r3 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Shouldn't this either stay the same or be shorter if it's using a smaller set?
In-process EDPs need to pre-compute the vidIndexMap which take some time, so the time is extended a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @kungfucraig)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r3.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @renjiezh, @SanjayVas, and @stevenwarejones)
src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt
line 88 at r3 (raw file):
Previously, renjiezh wrote…
For in-process tests, I think it is fine to be a list as they are intermediate variables. The sizes of them are not large.
For synthetic data test, it will be handled by craig's VidIndexMap class that takes in the population spec.
I'd like to deprecate SyntheticPopulationSpec for PopulationSpec and swap in the EDP support libs. That should be coming soon and will replace this. As such I don't feel too strongly about this so happy to see it left as is so we can move on with more important things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this!
Reviewed 21 of 31 files at r1, 8 of 9 files at r2, 1 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @SanjayVas)
No description provided.