Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define an official performance validation suite for etcd #16467

Open
1 of 2 tasks
jmhbnz opened this issue Aug 24, 2023 · 7 comments
Open
1 of 2 tasks

Define an official performance validation suite for etcd #16467

jmhbnz opened this issue Aug 24, 2023 · 7 comments
Assignees
Labels
area/performance area/tooling priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature

Comments

@jmhbnz
Copy link
Member

jmhbnz commented Aug 24, 2023

What would you like to be added?

The current performance validation process for etcd relies heavily on the Kubernetes scalability tests. While this approach has been valuable we need to create an official performance validation for etcd that is maintained within the project and therefore more accessible and integrated into regular project activity.

In my mind this will include developing a comprehensive suite of performance tests that cover various real-world usage scenarios. Integrating these tests into some form of on demand or scheduled etcd ci pipeline and making this accessible to work undertaken, for example ensure a pull request proposing upgrading a golang version can be validated for any performance regressions.

With this issue I would like to capture recent discussion in #16463 (comment) and the intent that we progress creating an independent and dedicated performance validation mechanism for etcd and ensure we do not lose sight of this work. We can use this issue to track any ideas and further conversation before starting any work.

References:

Why is this needed?

  • Reduce reliance on external testing suites that are less accessible.
  • Establish an official project perspective on performance.
  • Create the mechanism to track and drive future performance improvements.
  • Reduce cognitive burden for future etcd contributors and maintainers

Sub task tracking

@serathius
Copy link
Member

serathius commented Aug 24, 2023

Talked with @mborsz who is member of Kubernetes SIG scalability about how we should approach performance testing of etcd. We came to conclusion that we need 3 things:

  • etcd SLIs - important dimensions that we want to measure performance and prevent regressions of. It's that we have dedicated benchmark scenarios to each analyse independent dimensions as it's easier to reason and analyse the results. For that we can cherry pick existing etcd benchmarks. We can use current performance as baseline
  • Reproducibility - benchmarks results need to be repeatable. We need to run them in the same environment. Benchmark execution should happen not locally, but remotely on dedicated machine. Best case we have the benchmarks periodically on large github runner. We should also try to reduce the noise, so each scenario we should not run benchmark sequentially, but on separate machine, so we avoid bursts impacting results. Of course cloud VMs don't have the most stable performance, however it should suffice for now.
  • Visualization - To spot regressions we need to be able to observe trends and compare performance. Aside of per result reports, we should have a dashboard that aggregates results. At Google we use internal version of https://github.com/google/mako which is great, unfortunately looks like project has been archived. Kubernetes uses http://perf-dash.k8s.io/ which is pretty limited and will require code changes to support etcd. Please let me know if you have better suggestions.

Based on above points the work is:

  • Propose list of benchmarks to run
  • Setup a periodic job that executes the benchmark
  • Pick a dashboard for visualization and integrate benchmark reports.

@geetasg
Copy link

geetasg commented Aug 31, 2023

@jmhbnz
Copy link
Member Author

jmhbnz commented Sep 6, 2023

should the etcd SLIs be part of the contract ? Ref: https://docs.google.com/document/d/1NUZDiJeiIH5vo_FMaTWf0JtrQKCx0kpEaIIuPoj9P6A/edit#heading=h.tlkin1a8b8bl?

Potentially - Let's try and get some SLI's proposed initially and see how they fit in relation to the current contract? I have been meaning to sit down and list out potential SLI's here we can cherry pick from, feel free to do that same 🙏🏻

@jmhbnz
Copy link
Member Author

jmhbnz commented Nov 9, 2023

Recording a discussion during kubecon na - Along with identify service level indicators as a starting point for this work we can also take lessons from kubernetes sig-scale to identify a set of dimensions that our new performance validation suite will have an envelope within: https://github.com/kubernetes/community/blob/master/sig-scalability/configs-and-limits/thresholds.md

We can review the older benchmark tooling to get a starting point on dimensions and iterate from there.

@jmhbnz jmhbnz self-assigned this Dec 30, 2023
@jmhbnz jmhbnz added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Feb 11, 2024
@chaochn47
Copy link
Member

Expect the performance test suite should help detect/prevent #17529 or in the robustness test kubernetes traffic.

Do we think there is a gap in general on performance testing? I can help addressing it.

@jmhbnz @serathius @ahrtr

@jmhbnz jmhbnz pinned this issue Apr 3, 2024
@jmhbnz
Copy link
Member Author

jmhbnz commented Apr 3, 2024

Expect the performance test suite should help detect/prevent #17529 or in the robustness test kubernetes traffic.

Do we think there is a gap in general on performance testing? I can help addressing it.

Thanks @chaochn47 - Yes my expectations from updated performance validation suite once complete is we can catch issues like the one linked earlier. @ivanvc is currently getting some basic prow jobs running that will be running some existing tools like tools/benchmark and tools/rw-heatmaps. We will need to think about if any additional tooling, or further updates to that existing tooling are required. If you have any ideas on that please feel free to draft an feature issue so we can discuss 🙏🏻

@serathius
Copy link
Member

serathius commented Apr 18, 2024

Expect the performance test suite should help detect/prevent #17529 or in the robustness test kubernetes traffic.

Do we think there is a gap in general on performance testing? I can help addressing it.

Don't think so, performance and correctness are pretty different beast that needs different approaches. Checking correctness requires a lot of overhead to check it, while performance measuring wants as little noise as possible to provide reproducible results.

What failed in #17529 was an unknown throughput breaking point that was hiding a correctness issue under it. I think we can use performance testing to discover more of such breaking points, and then try to simulate them during correctness testing. This was already done in the e2e test that you provided in #17555. Failpoint beforeSendWatchResponse can be used to simulate slow response writing, which can simulate the same performance breaking point. Please see https://github.com/etcd-io/etcd/pull/17680/files where I managed to reproduce the issue using the breaking point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance area/tooling priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature
Development

No branches or pull requests

5 participants