Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add smaller synthetic data set for in-process tests. #1618

Merged
merged 8 commits into from May 24, 2024

Conversation

renjiezh
Copy link
Contributor

No description provided.

@wfa-reviewable
Copy link

This change is Reviewable

@renjiezh renjiezh changed the title Add smaller synthetic data set for in process tests. Add smaller synthetic data set for in-process tests. May 13, 2024
@renjiezh renjiezh requested review from SanjayVas and ple13 May 16, 2024 01:11
Copy link
Contributor

@ple13 ple13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 22 of 31 files at r1, 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @renjiezh and @SanjayVas)


src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 81 at r2 (raw file):

    )

  // TODO(@ple13): Use the actual vidToIndexMap instead of an empty map when a smaller dataset is

nit: remove this TODO.

Copy link
Member

@SanjayVas SanjayVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 31 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @renjiezh)

a discussion (no related file):
The current tests all use the same large population size, but a smaller number of impressions for local LLv2 tests since the full impression set takes too long and would consume too much memory (i.e. wouldn't fit in a standard GHA runner).

Is the concern here for HMSS time or memory? I'm assuming that we can generate the VID to index map at build time since we know the population spec (e.g. via a Bazel rule), so that shouldn't affect the amount of time it takes to run the test.



src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue line 28 at r2 (raw file):

_populationSpec: "/etc/\(#AppName)/config-files/synthetic_population_spec_small.textproto"
_eventGroupSpecs: [

I believe we want to continue to use the large sets with a realistic population size for cloud tests. @kungfucraig can confirm.

Copy link
Member

@kungfucraig kungfucraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @renjiezh and @SanjayVas)


src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue line 28 at r2 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

I believe we want to continue to use the large sets with a realistic population size for cloud tests. @kungfucraig can confirm.

+1 - by default we should have the large population available for the cloud

Copy link
Contributor Author

@renjiezh renjiezh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 29 of 31 files reviewed, 2 unresolved discussions (waiting on @ple13 and @SanjayVas)

a discussion (no related file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

The current tests all use the same large population size, but a smaller number of impressions for local LLv2 tests since the full impression set takes too long and would consume too much memory (i.e. wouldn't fit in a standard GHA runner).

Is the concern here for HMSS time or memory? I'm assuming that we can generate the VID to index map at build time since we know the population spec (e.g. via a Bazel rule), so that shouldn't affect the amount of time it takes to run the test.

It is because of the memory. So we will use small population for in-process tests.



src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue line 28 at r2 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

+1 - by default we should have the large population available for the cloud

Done.

Copy link
Member

@SanjayVas SanjayVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 17 of 31 files at r1, 8 of 9 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @renjiezh)


src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 82 at r3 (raw file):

  val vidRange =
    requireNotNull(syntheticPopulationSpec.vidRange) {

This is a protobuf field, right? Those cannot be null.

Code quote:

syntheticPopulationSpec.vidRange

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 87 at r3 (raw file):

  val vidIndexMap =
    VidToIndexMapGenerator.generateMapping(
      salt = "".toByteStringUtf8(),

nit

Suggestion:

ByteString.EMPTY

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 88 at r3 (raw file):

    VidToIndexMapGenerator.generateMapping(
      salt = "".toByteStringUtf8(),
      vidUniverse = (vidRange.start until vidRange.endExclusive).toList(),

You can probably save on memory for cases where the VID universe is a contiguous range the vidUniverse param type be Sequence instead of a List. Even for cases where it's multiple ranges with gaps, a Sequence can be much more compact. If it's truly well-distributed across the numerical space, a List can be trivially wrapped in a Sequence.

@kungfucraig WDYT?

Leaving this as optional in case you want to address in a separate PR.


src/test/kotlin/org/wfanet/measurement/integration/deploy/gcloud/GCloudPostgresInProcessLifeOfAMeasurementIntegrationTest.kt line 43 at r3 (raw file):

   * TODO(Kotlin/kotlinx.coroutines#3865): Switch back to CoroutinesTimeout when fixed.
   */
  @get:Rule val timeout: Timeout = Timeout.seconds(180)

Shouldn't this either stay the same or be shorter if it's using a smaller set?

Code quote:

180

Copy link
Contributor Author

@renjiezh renjiezh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 30 of 31 files reviewed, 3 unresolved discussions (waiting on @kungfucraig and @SanjayVas)


src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 82 at r3 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

This is a protobuf field, right? Those cannot be null.

Done.


src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 88 at r3 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

You can probably save on memory for cases where the VID universe is a contiguous range the vidUniverse param type be Sequence instead of a List. Even for cases where it's multiple ranges with gaps, a Sequence can be much more compact. If it's truly well-distributed across the numerical space, a List can be trivially wrapped in a Sequence.

@kungfucraig WDYT?

Leaving this as optional in case you want to address in a separate PR.

For in-process tests, I think it is fine to be a list as they are intermediate variables. The sizes of them are not large.
For synthetic data test, it will be handled by craig's VidIndexMap class that takes in the population spec.


src/test/kotlin/org/wfanet/measurement/integration/deploy/gcloud/GCloudPostgresInProcessLifeOfAMeasurementIntegrationTest.kt line 43 at r3 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

Shouldn't this either stay the same or be shorter if it's using a smaller set?

In-process EDPs need to pre-compute the vidIndexMap which take some time, so the time is extended a bit.

Copy link
Member

@SanjayVas SanjayVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @kungfucraig)

Copy link
Member

@kungfucraig kungfucraig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r3.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @renjiezh, @SanjayVas, and @stevenwarejones)


src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 88 at r3 (raw file):

Previously, renjiezh wrote…

For in-process tests, I think it is fine to be a list as they are intermediate variables. The sizes of them are not large.
For synthetic data test, it will be handled by craig's VidIndexMap class that takes in the population spec.

I'd like to deprecate SyntheticPopulationSpec for PopulationSpec and swap in the EDP support libs. That should be coming soon and will replace this. As such I don't feel too strongly about this so happy to see it left as is so we can move on with more important things.

Copy link
Collaborator

@stevenwarejones stevenwarejones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!

Reviewed 21 of 31 files at r1, 8 of 9 files at r2, 1 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @SanjayVas)

@renjiezh renjiezh enabled auto-merge (squash) May 24, 2024 18:03
@renjiezh renjiezh merged commit a4ca481 into main May 24, 2024
4 checks passed
@renjiezh renjiezh deleted the renjiez-hmss-dataset branch May 24, 2024 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants