Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Extremely High Throughput by holding back requests to etcd until the throughput decreases. #16837

Open
Sharpz7 opened this issue Oct 27, 2023 · 7 comments

Comments

@Sharpz7
Copy link

Sharpz7 commented Oct 27, 2023

What would you like to be added?

Find the original k8s ticket here: kubernetes/kubernetes#120781.

Essentially, what the title says. If someone wants to try and enter an incredibly high number of key-value pairs all at once, there should be a way to hold them back.

As I said in my last comment in the original ticket in k8s, I am not convinced this belongs here. But, it is something I am very, very interested in, and would be happy to pivot to whatever is needed and work on this personally.

Thanks!

Why is this needed?

For people dealing with extremely high-throughput batch work (i.e 1000's jobs / second, lasting 1-2 mins each), etcd starts to become a real problem.

Links to back up this point:

https://etcd.io/docs/v3.5/op-guide/performance/
https://github.com/armadaproject/armada: A scheduling solution partially designed around this problem.

In the original ticket (kubernetes/kubernetes#120781) it was agreed:

  • Using a non-etcd backend is not the desired end state (etcd is great!)
  • It should be in-tree for etcd
@serathius
Copy link
Member

Essentially, what the title says. If someone wants to try and enter an incredibly high number of key-value pairs all at once, there should be a way to hold them back.

Yes, there is, at least in Kubernetes. https://kubernetes.io/docs/concepts/cluster-administration/flow-control/
For single tenant case (etcd used by single Kubernetes cluster), there is no motivation to have a APF on etcd side. Kubernetes is better aware of what kind of requests it is sending to etcd.

Essentially, what the title says. If someone wants to try and enter an incredibly high number of key-value pairs all at once, there should be a way to hold them back.
...
For people dealing with extremely high-throughput batch work (i.e 1000's jobs / second, lasting 1-2 mins each), etcd starts to become a real problem.

What you are describing is issue with write throughput? Is that correct?

I haven't heard of real world cases where writes could topple etcd. Write throughput depends mostly on disk performance and is not as resource intensive on memory or CPU (would need a test to confirm). Also etcd limits number of pending proposals so at some point no new proposals should be accepted.

Are you sure there is no other accompanying requests other than just writes that could be cause of the problem? For example cost of writes, scales with number of watchers. If you have many watches established this would make more sense. I think we need more concrete data points then just saying etcd becomes a problem. What exact traffic goes into etcd, performance metrics and profiles to be able to answer what is the problem in your case.

Overall I think this is a scalability problem, in such cases there is no single problem that would allow us to scale to 1000 qps. Fixing one bottleneck will just surfice another issue. Solution defining the exact scenario we want to improve, picking the success metric, and progressive improvements towards the goal. Such as #16467

@serathius
Copy link
Member

serathius commented Oct 27, 2023

I remembered one case where testing high throughput affected etcd, however it only caused high memory usage due to increased number of allocations required by PrevKey watch option used by Kubernetes #16839.

Issue was easily mitigated by changing GC to be more aggresive.

@Sharpz7
Copy link
Author

Sharpz7 commented Oct 30, 2023

Really appreciate you getting back to me. This gives me lots of research and ideas to throw about with my team. If we still have questions / issues, I will try and do it more formally like you suggested.

Thanks!

@Sharpz7
Copy link
Author

Sharpz7 commented Nov 29, 2023

Hey @serathius, so talking with people more, it seems that we suffer from very high pod-churn with the pods having very large manifests (20-50 kB), so we are constantly having to defragment etcd. OpenShift have an operator that handles this most of the time https://docs.openshift.com/container-platform/4.14/scalability_and_performance/recommended-performance-scale-practices/recommended-etcd-practices.html#manual-defrag-etcd-data_recommended-etcd-practices but they still need some manual work.

Is there a way of doing this natively with an operator that etcd have? Or is that something that could be created? Thanks again.

@jmhbnz
Copy link
Member

jmhbnz commented Nov 30, 2023

Hey @Sharpz7 - One of the etcd maintainers @ahrtr has put together an etcd defrag helper utility https://github.com/ahrtr/etcd-defrag.

This can be run via a kubernetes cronjob and have rules applied to ensure defrag is only run if actually required.

It might be a helpful approach, however please bear in mind this is not an official etcd subproject at this point.

@Sharpz7
Copy link
Author

Sharpz7 commented Dec 4, 2023

Appreciate you getting back to me - this is really cool! Thanks

Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants