Add smaller synthetic data set for in-process tests. #1618

renjiezh · 2024-05-13T16:58:54Z

No description provided.

wfa-reviewable · 2024-05-13T16:59:00Z

This change is

ple13

Reviewed 22 of 31 files at r1, 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @renjiezh and @SanjayVas)

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 81 at r2 (raw file):

    )

  // TODO(@ple13): Use the actual vidToIndexMap instead of an empty map when a smaller dataset is

nit: remove this TODO.

SanjayVas

Reviewed 5 of 31 files at r1, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @renjiezh)

a discussion (no related file):
The current tests all use the same large population size, but a smaller number of impressions for local LLv2 tests since the full impression set takes too long and would consume too much memory (i.e. wouldn't fit in a standard GHA runner).

Is the concern here for HMSS time or memory? I'm assuming that we can generate the VID to index map at build time since we know the population spec (e.g. via a Bazel rule), so that shouldn't affect the amount of time it takes to run the test.

src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue line 28 at r2 (raw file):

_populationSpec: "/etc/\(#AppName)/config-files/synthetic_population_spec_small.textproto"
_eventGroupSpecs: [

I believe we want to continue to use the large sets with a realistic population size for cloud tests. @kungfucraig can confirm.

kungfucraig

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @renjiezh and @SanjayVas)

src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue line 28 at r2 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

I believe we want to continue to use the large sets with a realistic population size for cloud tests. @kungfucraig can confirm.

+1 - by default we should have the large population available for the cloud

renjiezh

Reviewable status: 29 of 31 files reviewed, 2 unresolved discussions (waiting on @ple13 and @SanjayVas)

a discussion (no related file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

The current tests all use the same large population size, but a smaller number of impressions for local LLv2 tests since the full impression set takes too long and would consume too much memory (i.e. wouldn't fit in a standard GHA runner).

Is the concern here for HMSS time or memory? I'm assuming that we can generate the VID to index map at build time since we know the population spec (e.g. via a Bazel rule), so that shouldn't affect the amount of time it takes to run the test.

It is because of the memory. So we will use small population for in-process tests.

src/main/k8s/dev/synthetic_generator_edp_simulator_gke.cue line 28 at r2 (raw file):

Previously, kungfucraig (Craig Wright) wrote…

+1 - by default we should have the large population available for the cloud

Done.

SanjayVas

Reviewed 17 of 31 files at r1, 8 of 9 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @renjiezh)

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 82 at r3 (raw file):

  val vidRange =
    requireNotNull(syntheticPopulationSpec.vidRange) {

This is a protobuf field, right? Those cannot be null.

Code quote:

syntheticPopulationSpec.vidRange

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 87 at r3 (raw file):

  val vidIndexMap =
    VidToIndexMapGenerator.generateMapping(
      salt = "".toByteStringUtf8(),

nit

Suggestion:

ByteString.EMPTY

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 88 at r3 (raw file):

    VidToIndexMapGenerator.generateMapping(
      salt = "".toByteStringUtf8(),
      vidUniverse = (vidRange.start until vidRange.endExclusive).toList(),

You can probably save on memory for cases where the VID universe is a contiguous range the vidUniverse param type be Sequence instead of a List. Even for cases where it's multiple ranges with gaps, a Sequence can be much more compact. If it's truly well-distributed across the numerical space, a List can be trivially wrapped in a Sequence.

@kungfucraig WDYT?

Leaving this as optional in case you want to address in a separate PR.

src/test/kotlin/org/wfanet/measurement/integration/deploy/gcloud/GCloudPostgresInProcessLifeOfAMeasurementIntegrationTest.kt line 43 at r3 (raw file):

   * TODO(Kotlin/kotlinx.coroutines#3865): Switch back to CoroutinesTimeout when fixed.
   */
  @get:Rule val timeout: Timeout = Timeout.seconds(180)

Shouldn't this either stay the same or be shorter if it's using a smaller set?

Code quote:

renjiezh

Reviewable status: 30 of 31 files reviewed, 3 unresolved discussions (waiting on @kungfucraig and @SanjayVas)

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 82 at r3 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

This is a protobuf field, right? Those cannot be null.

Done.

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 88 at r3 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

You can probably save on memory for cases where the VID universe is a contiguous range the vidUniverse param type be Sequence instead of a List. Even for cases where it's multiple ranges with gaps, a Sequence can be much more compact. If it's truly well-distributed across the numerical space, a List can be trivially wrapped in a Sequence.

@kungfucraig WDYT?

Leaving this as optional in case you want to address in a separate PR.

For in-process tests, I think it is fine to be a list as they are intermediate variables. The sizes of them are not large.
For synthetic data test, it will be handled by craig's VidIndexMap class that takes in the population spec.

src/test/kotlin/org/wfanet/measurement/integration/deploy/gcloud/GCloudPostgresInProcessLifeOfAMeasurementIntegrationTest.kt line 43 at r3 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

Shouldn't this either stay the same or be shorter if it's using a smaller set?

In-process EDPs need to pre-compute the vidIndexMap which take some time, so the time is extended a bit.

SanjayVas

Reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @kungfucraig)

kungfucraig

Reviewed 1 of 2 files at r3.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @renjiezh, @SanjayVas, and @stevenwarejones)

src/main/kotlin/org/wfanet/measurement/integration/common/InProcessEdpSimulator.kt line 88 at r3 (raw file):

Previously, renjiezh wrote…

For in-process tests, I think it is fine to be a list as they are intermediate variables. The sizes of them are not large.
For synthetic data test, it will be handled by craig's VidIndexMap class that takes in the population spec.

I'd like to deprecate SyntheticPopulationSpec for PopulationSpec and swap in the EDP support libs. That should be coming soon and will replace this. As such I don't feel too strongly about this so happy to see it left as is so we can move on with more important things.

stevenwarejones

Thanks for doing this!

Reviewed 21 of 31 files at r1, 8 of 9 files at r2, 1 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @SanjayVas)

Add smaller synthetic data set for in process tests.

cd5ef6c

renjiezh changed the title ~~Add smaller synthetic data set for in process tests.~~ Add smaller synthetic data set for in-process tests. May 13, 2024

renjiezh added 4 commits May 15, 2024 19:09

Use small synthetic dataset for in-process tests

9857db8

Update config_files data

962a15f

Update AccuracyTest

d1c7177

Update InProcessEdpSimulator to have VidIndexMap

35c223b

renjiezh requested review from SanjayVas and ple13 May 16, 2024 01:11

ple13 approved these changes May 16, 2024

View reviewed changes

SanjayVas requested changes May 20, 2024

View reviewed changes

kungfucraig reviewed May 20, 2024

View reviewed changes

Use large dataset for dev edp simulators.

5c0011d

renjiezh commented May 23, 2024

View reviewed changes

SanjayVas requested changes May 23, 2024

View reviewed changes

Update based on PR comments.

b47fecd

renjiezh commented May 23, 2024

View reviewed changes

SanjayVas requested a review from kungfucraig May 23, 2024 19:35

SanjayVas approved these changes May 23, 2024

View reviewed changes

renjiezh requested a review from stevenwarejones May 23, 2024 20:32

kungfucraig reviewed May 23, 2024

View reviewed changes

kungfucraig requested a review from SanjayVas May 23, 2024 20:38

stevenwarejones approved these changes May 24, 2024

View reviewed changes

Merge branch 'main' into renjiez-hmss-dataset

fa740fd

renjiezh enabled auto-merge (squash) May 24, 2024 18:03

renjiezh merged commit a4ca481 into main May 24, 2024
4 checks passed

renjiezh deleted the renjiez-hmss-dataset branch May 24, 2024 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add smaller synthetic data set for in-process tests. #1618

Add smaller synthetic data set for in-process tests. #1618

renjiezh commented May 13, 2024

wfa-reviewable commented May 13, 2024

ple13 left a comment

SanjayVas left a comment

kungfucraig left a comment

renjiezh left a comment

SanjayVas left a comment

renjiezh left a comment

SanjayVas left a comment

kungfucraig left a comment

stevenwarejones left a comment

Add smaller synthetic data set for in-process tests. #1618

Add smaller synthetic data set for in-process tests. #1618

Conversation

renjiezh commented May 13, 2024

wfa-reviewable commented May 13, 2024

ple13 left a comment

Choose a reason for hiding this comment

SanjayVas left a comment

Choose a reason for hiding this comment

kungfucraig left a comment

Choose a reason for hiding this comment

renjiezh left a comment

Choose a reason for hiding this comment

SanjayVas left a comment

Choose a reason for hiding this comment

renjiezh left a comment

Choose a reason for hiding this comment

SanjayVas left a comment

Choose a reason for hiding this comment

kungfucraig left a comment

Choose a reason for hiding this comment

stevenwarejones left a comment

Choose a reason for hiding this comment