-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apiserver latency metrics create enormous amount of time-series #105346
Comments
I believe this should go to Please correct me if I'm wrong |
/sig instrumentation |
/assign @logicalhan |
/triage accepted |
/sig scalability These buckets were added quite deliberately and is quite possibly the most important metric served by the apiserver. The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting. If you are having issues with ingestion (i.e. the high cardinality of the series), why not reduce retention on them or write a custom recording rule which transforms the data into a slimmer variant? |
+1 to all of that
I don't understand this - how do they grow with cluster size? The buckets are constant. [FWIW - we're monitoring it for every GKE cluster and it works for us...] |
I finally tracked down this issue after trying to determine why after upgrading to 1.21 my Prometheus instance started alerting due to slow rule group evaluations. After doing some digging, it turned out the problem is that simply scraping the metrics endpoint for the apiserver takes around 5-10s on a regular basis, which ends up causing rule groups which scrape those endpoints to fall behind, hence the alerts. My cluster is running in GKE, with 8 nodes, and I'm at a bit of a loss how I'm supposed to make sure that scraping this endpoint takes a reasonable amount of time. Are the series reset after every scrape, so scraping more frequently will actually be faster? Regardless, 5-10s for a small cluster like mine seems outrageously expensive. If there is a recommended approach to deal with this, I'd love to know what that is, as the issue for me isn't storage or retention of high cardinality series, its that the metrics endpoint itself is very slow to respond due to all of the time series. @wojtek-t Since you are also running on GKE, perhaps you have some idea what I've missed?
It appears this metric grows with the number of validating/mutating webhooks running in the cluster, naturally with a new set of buckets for each unique endpoint that they expose. Here's a subset of some URLs I see reported by this metric in my cluster:
Not sure how helpful that is, but I imagine that's what was meant by @herewasmike EDIT: For some additional information, running a query on |
Prometheus uses memory mainly for ingesting time-series into head. Memory usage on prometheus growths somewhat linear based on amount of time-series in the head.
As @bitwalker already mentioned, adding new resources multiplies cardinality of apiserver's metrics. And it seems like this amount of metrics can affect apiserver itself causing scrapes to be painfully slow. As for
Summary will always provide you with more precise data than histogram |
I think summaries have their own issues; they are more expensive to calculate, hence why histograms were preferred for this metric, at least as I understand the context. Of course, it may be that the tradeoff would have been better in this case, I don't know what kind of testing/benchmarking was done. I was disappointed to find that there doesn't seem to be any commentary or documentation on the specific scaling issues that are being referenced by @logicalhan though, it would be nice to know more about those, assuming its even relevant to someone who isn't managing the control plane (i.e. those of us on GKE). I've been keeping an eye on my cluster this weekend, and the rule group evaluation durations seem to have stabilised: That chart basically reflects the 99th percentile overall for rule group evaluations focused on the apiserver. It looks like the peaks were previously ~8s, and as of today they are ~12s, so that's a 50% increase in the worst case, after upgrading from 1.20 to 1.21. It does appear that the 90th percentile is roughly equivalent to where it was before the upgrade now, discounting the weird peak right after the upgrade. Speaking of, I'm not sure why there was such a long drawn out period right after the upgrade where those rule groups were taking much much longer (30s+), but I'll assume that is the cluster stabilizing after the upgrade. Anyway, hope this additional follow up info is helpful! |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
We reduced the amount of time-series in #106306 |
Background:
In scope of #73638 and kubernetes-sigs/controller-runtime#1273 amount of buckets for this histogram was increased to 40(!)
Buckets: []float64{0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 60},
Problem
This causes anyone who still wants to monitor apiserver to handle tons of metrics.
Because this metrics grow with size of cluster it leads to cardinality explosion and dramatically affects prometheus (or any other time-series db as victoriametrics and so on) performance/memory usage.
E.g. from one of my clusters:
apiserver_request_duration_seconds_bucket metric name has 7 times more values than any other.
Proposal
There's some possible solutions for this issue.
One would be allowing end-user to define buckets for apiserver.
Pros: We still use histograms that are cheap for apiserver (though, not sure how good this works for 40 buckets case 😃 )
Cons:
Second one is to use summary for this purpose.
Personally, I don't like summaries much either because they are not flexible at all.
Though, histograms require one to define buckets suitable for the case. Adding all possible options (as was done in commits pointed above) is not a solution
Pros:
Cons:
Appreciate any feedback on this request.
The text was updated successfully, but these errors were encountered: