Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF-5441 Adds the time_series_lastpoint_high_cardinality workload. #1218

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

yun-soo
Copy link
Contributor

@yun-soo yun-soo commented May 10, 2024

Jira Ticket: PERF-5441

Whats Changed

Adds the time_series_lastpoint_high_cardinality workload.

Patch Testing Results

https://spruce.mongodb.com/version/663f01f365c3b80007df9990/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

@yun-soo yun-soo requested a review from a team as a code owner May 10, 2024 23:25
@yun-soo yun-soo requested a review from alicedoherty May 10, 2024 23:25
@yun-soo yun-soo changed the title PERF-5441 Creates the time_series_lastpoint_high_cardinality workload. PERF-5441 Adds the time_series_lastpoint_high_cardinality workload. May 11, 2024
@alicedoherty
Copy link
Contributor

Thanks @yun-soo! Would you be able to fill out the New Workload Request form? Also just want to confirm that for this PR only the TimeSeriesLastpointHighCardinality workload file needs to be reviewed, and the changes to the TimeSeriesLastpoint workload/phase file will be dealt with in #1219?

Do you have an idea of how noisy this workload is? We'd generally recommend running x5 patches to get a sense of this. You can use the performance analyzer to automate this (see the third method here).

@yun-soo
Copy link
Contributor Author

yun-soo commented May 13, 2024

Thanks a lot for the review, @alicedoherty!

Would you be able to fill out the New Workload Request form?

Filed the request. I noticed that build variant list is not consistent with the up-to-date one.

Also just want to confirm that for this PR only the TimeSeriesLastpointHighCardinality workload file needs to be reviewed, and the changes to the TimeSeriesLastpoint workload/phase file will be dealt with in #1219?

Yes, only TimeSeriesLastpointHighCardinality needs to be reviewed and they will in #1219

Do you have an idea of how noisy this workload is? We'd generally recommend running x5 patches to get a sense of this. You can use the performance analyzer to automate this (see the third method here).

Just scheduled one here. Will update this as soon as I will get results.
Update:
Got perf analyzer results.
OperationThroughput, Latency50thPercentile, Latency95Percentil's min/max values against mean fall into [-4.86%, 7.41%] except two outliers. Two outliers are:

  • MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate's min/max on perf-3-node-replSet-intel.intel.aws.2023-11: [-6.50%, 25.38%]
  • MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate's min/max on perf-3-node-replSet.arm.aws.2023-11: [-30.92%, 148.77%]

Copy link
Contributor

@alicedoherty alicedoherty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yun-soo! Sorry for the delay, was focusing on skunkworks. Thanks for calling out that the variants in the form are outdated, I've opened PERF-5488.

Do you have any idea why the Latency95thPercentile for MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate is so noisy? Since this seems like one of the key metrics you care about this would make triaging changepoints pretty hard for this measurement.

Also as an FYI in the perf analyzer if you select to show the "CoV (%)" it will show the calculated coefficient of variation (the main measure of noise our team uses) for each measurement in your patches to save you some time when doing a noise analysis.

$gte: v7.0
mongodb_setup:
$eq:
- standalone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've gotten rid of our regular standalone variants. Having this line will just mean it will only be running on the ARM and Intel 3-Node ReplSet 2023-11 variants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies but I don't understand what you're suggesting here. Could you clarify on it so that I know what should be changed? I could not exactly correlate these names with the updated build variants.

These lines are exactly same as in the companion workload TimeSeriesLastpoint.yml. And I was able to see that this workload is enabled on perf-3-node-replSet-intel.intel.aws.2023-11 and perf-3-node-replSet.arm.aws.2023-11 and perf-3-node-replSet-all-feature-flags.arm.2023-11. I think these would be enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the way I worded that was confusing 😅 We've removed our regular standalone variants (SERVER-89499). Having $eq: standalone, standalone-XXX won't cause any issues, it just means it won't match an existing variant, and therefore won't run on any standalone variants. I just wanted to call this out so you weren't surprised that the test isn't running on standalone variants despite having this line in your code.

For clarity it's probably best to remove the standalone lines here.

(JFYI WRT the correlation between these lines and our variants - our variants are defined in perf.yml and perf_branching.yml. Each variant has a mongodb_setup: XXX line. In your Genny file if you write AutoRun: When: mongodb_setup: $eq: replica this means it will run on any variant that has mongodb_setup: replica)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah~ makes sense now!! Thanks for detailed explanation. Really appreciate it! Removed standalone(-*) for clarity just like you suggested.

Copy link
Contributor Author

@yun-soo yun-soo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review! Addressed comments or asked follow up questions.

Sorry for the delay, was focusing on skunkworks. Thanks for calling out that the variants in the form are outdated, I've opened PERF-5488.

No worries and I was also focusing on skunkworks as many others were. Thanks for filing the ticket.

Do you have any idea why the Latency95thPercentile for MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate is so noisy? Since this seems like one of the key metrics you care about this would make triaging changepoints pretty hard for this measurement.

I don't have a good idea on why yet. I'll spend some time to figure out why but it'll take some time. I'll get back to this PR when I have a good idea. Do you happen to know what could cause such noisiness in other workloads?

Also as an FYI in the perf analyzer if you select to show the "CoV (%)" it will show the calculated coefficient of variation (the main measure of noise our team uses) for each measurement in your patches to save you some time when doing a noise analysis.

Thanks for letting me know. What's the threshold which the team decides a metric is too noisy on? I noticed that CoVs were there, but I didn't know how to interpret those values.

$gte: v7.0
mongodb_setup:
$eq:
- standalone
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies but I don't understand what you're suggesting here. Could you clarify on it so that I know what should be changed? I could not exactly correlate these names with the updated build variants.

These lines are exactly same as in the companion workload TimeSeriesLastpoint.yml. And I was able to see that this workload is enabled on perf-3-node-replSet-intel.intel.aws.2023-11 and perf-3-node-replSet.arm.aws.2023-11 and perf-3-node-replSet-all-feature-flags.arm.2023-11. I think these would be enough.

Copy link
Contributor

@alicedoherty alicedoherty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you happen to know what could cause such noisiness in other workloads?

Looking at the existing time_series_lastpoint test it seems much of the underlying test design/queries are noisy. If you're getting useful signal from that existing test I think it would be okay to merge this test with similar levels, however, I definitely think it's worth looking into the noise of Latency95thPercentile for MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate as a CV of 73% (!) and 12% seems like an indication that something is not testing what you'd expect. Since the really bad noise is limited to this specific phase and (potentially) ARM it might be a good idea to do some profiling, t2 analysis, etc. on that specific part of the test (looking at one of the really good vs bad runs). There are some general resources documented here, but feel free to ask in #ask-devprod-performance if you need more guidance.

What's the threshold which the team decides a metric is too noisy on?

We've struggled to define a specific threshold. Ideally we would like it <1% but at the moment with our current testing infrastructure this seems a little unrealistic. Personally, for tests like this I'd be concerned about metrics with CV >5%.

If this is a time-sensitive feature you want to test - it could also be an option to tell the build barons to ignore the really noisy metric (since it's not the whole task) and open a follow-on ticket to investigate this further.

$gte: v7.0
mongodb_setup:
$eq:
- standalone
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry the way I worded that was confusing 😅 We've removed our regular standalone variants (SERVER-89499). Having $eq: standalone, standalone-XXX won't cause any issues, it just means it won't match an existing variant, and therefore won't run on any standalone variants. I just wanted to call this out so you weren't surprised that the test isn't running on standalone variants despite having this line in your code.

For clarity it's probably best to remove the standalone lines here.

(JFYI WRT the correlation between these lines and our variants - our variants are defined in perf.yml and perf_branching.yml. Each variant has a mongodb_setup: XXX line. In your Genny file if you write AutoRun: When: mongodb_setup: $eq: replica this means it will run on any variant that has mongodb_setup: replica)

Copy link
Contributor Author

@yun-soo yun-soo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think it's worth looking into the noise of Latency95thPercentile for MetaDescTimeDesc.RunLastPointQueryWithSmallPredicate as a CV of 73% (!) and 12% seems like an indication that something is not testing what you'd expect. Since the really bad noise is limited to this specific phase and (potentially) ARM it might be a good idea to do some profiling, t2 analysis, etc. on that specific part of the test (looking at one of the really good vs bad runs). There are some general resources documented here, but feel free to ask in #ask-devprod-performance if you need more guidance.

Thanks for detailed guidance and really appreciate it! I'll definitely look into those two metrics. I wanted to measure improved performance for the higher cardinality workload thanks to a new optimization because TimeSeriesLastpoint.yml had a bug (so a noise) and also didn't give my team a good signal about the new optimization but don't want to get too much noise from this new workload either.

If this is a time-sensitive feature you want to test - it could also be an option to tell the build barons to ignore the really noisy metric (since it's not the whole task) and open a follow-on ticket to investigate this further.

Actually, that's not the case. I confirmed that the feature actually improves the performance in other workloads and this workload is for future performance regression testing, which is not urgent. I wanted to give a closure to the feature. I'm in the middle of something for the other project and so will get back to you with my analysis and any changes necessary to this workload.

$gte: v7.0
mongodb_setup:
$eq:
- standalone
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah~ makes sense now!! Thanks for detailed explanation. Really appreciate it! Removed standalone(-*) for clarity just like you suggested.


- LoadConfig:
Path: *phasePath
Key: QuiesceActor
Parameters:
MaxPhases: 23
MaxPhases: *MaxPhases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It might be clearer if the active phases for the quiesce actor are defined here rather than inheriting the defaults.
That way all the phase info is in this file, and it lines up with what is done for TimeSeriesLastpointHighCardinality.yml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in #1219.


##################################################################################################
# Following 4 actors including the hidden QuiesceActor in phase 20 run a lastpoint query with a
# sort and group on meta subfield ascending and time descending.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this say meta subfield descending?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! done.


- LoadConfig:
Path: *phasePath
Key: QuiesceActor
Parameters:
MaxPhases: 23
MaxPhases: *MaxPhases

AutoRun:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are deprecating support for standalone variants.
Can you remove the AutoRun conditions for the standalone setups?
This would also then match when the HighCardinality workload gets scheduled.

Also should the branch conditional be "$gte: v7.0" like it is for the HighCardinality one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the AutoRun conditions for the standalone setups?

This is done in #1219.

should the branch conditional be "$gte: v7.0" like it is for the HighCardinality one?

This is answered in #1219.

Copy link
Contributor Author

@yun-soo yun-soo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jawwadasghar, addressed your comments


##################################################################################################
# Following 4 actors including the hidden QuiesceActor in phase 20 run a lastpoint query with a
# sort and group on meta subfield ascending and time descending.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! done.


- LoadConfig:
Path: *phasePath
Key: QuiesceActor
Parameters:
MaxPhases: 23
MaxPhases: *MaxPhases
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done in #1219.


- LoadConfig:
Path: *phasePath
Key: QuiesceActor
Parameters:
MaxPhases: 23
MaxPhases: *MaxPhases

AutoRun:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove the AutoRun conditions for the standalone setups?

This is done in #1219.

should the branch conditional be "$gte: v7.0" like it is for the HighCardinality one?

This is answered in #1219.

Copy link
Contributor

@jawwadasghar jawwadasghar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants