Apiserver latency metrics create enormous amount of time-series #105346

herewasmike · 2021-09-29T12:53:07Z

Background:

In scope of #73638 and kubernetes-sigs/controller-runtime#1273 amount of buckets for this histogram was increased to 40(!)
Buckets: []float64{0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 60},

Problem

This causes anyone who still wants to monitor apiserver to handle tons of metrics.
Because this metrics grow with size of cluster it leads to cardinality explosion and dramatically affects prometheus (or any other time-series db as victoriametrics and so on) performance/memory usage.

E.g. from one of my clusters:

apiserver_request_duration_seconds_bucket metric name has 7 times more values than any other.

Proposal

There's some possible solutions for this issue.

One would be allowing end-user to define buckets for apiserver.

Pros: We still use histograms that are cheap for apiserver (though, not sure how good this works for 40 buckets case 😃 )
Cons:

Requires end user to understand what happens
Adds another moving part in the system (violate KISS principle)
Doesn't work well in case there is not homogeneous load (e.g. requests to some api are served within hundreds of milliseconds and other in 10-20 seconds )

Second one is to use summary for this purpose.
Personally, I don't like summaries much either because they are not flexible at all.
Though, histograms require one to define buckets suitable for the case. Adding all possible options (as was done in commits pointed above) is not a solution

Pros:

Significantly reduce amount of time-series returned by apiserver's metrics page as summary uses one ts per defined percentile + 2 (_sum and _count)
Solves this issue entirely
Still simple and stupid
Cons:
Requires slightly more resources on apiserver's side to calculate percentiles
Percentiles have to be defined in code and can't be changed during runtime (though, most use cases are covered by 0.5, 0.95 and 0.99 percentiles so personally I would just hardcode them)

Appreciate any feedback on this request.

The text was updated successfully, but these errors were encountered:

herewasmike · 2021-09-29T13:53:27Z

I believe this should go to
/sig api-machinery

Please correct me if I'm wrong

jpbetz · 2021-09-30T20:10:52Z

/sig instrumentation

fedebongio · 2021-10-05T20:25:28Z

/assign @logicalhan
(assigning to sig instrumentation)
/remove-sig api-machinery

logicalhan · 2021-10-07T16:37:54Z

/triage accepted

logicalhan · 2021-10-07T22:19:03Z

/sig scalability
/cc @wojtek-t

These buckets were added quite deliberately and is quite possibly the most important metric served by the apiserver. The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting. If you are having issues with ingestion (i.e. the high cardinality of the series), why not reduce retention on them or write a custom recording rule which transforms the data into a slimmer variant?

wojtek-t · 2021-10-08T12:58:41Z

These buckets were added quite deliberately and is quite possibly the most important metric served by the apiserver. The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting. If you are having issues with ingestion (i.e. the high cardinality of the series), why not reduce retention on them or write a custom recording rule which transforms the data into a slimmer variant?

+1 to all of that

Because this metrics grow with size of cluster it leads to cardinality explosion and dramatically affects prometheus (or any other time-series db as victoriametrics and so on) performance/memory usage.

I don't understand this - how do they grow with cluster size? The buckets are constant.

[FWIW - we're monitoring it for every GKE cluster and it works for us...]

bitwalker · 2021-10-09T23:56:38Z

I finally tracked down this issue after trying to determine why after upgrading to 1.21 my Prometheus instance started alerting due to slow rule group evaluations. After doing some digging, it turned out the problem is that simply scraping the metrics endpoint for the apiserver takes around 5-10s on a regular basis, which ends up causing rule groups which scrape those endpoints to fall behind, hence the alerts.

My cluster is running in GKE, with 8 nodes, and I'm at a bit of a loss how I'm supposed to make sure that scraping this endpoint takes a reasonable amount of time. Are the series reset after every scrape, so scraping more frequently will actually be faster? Regardless, 5-10s for a small cluster like mine seems outrageously expensive. If there is a recommended approach to deal with this, I'd love to know what that is, as the issue for me isn't storage or retention of high cardinality series, its that the metrics endpoint itself is very slow to respond due to all of the time series.

@wojtek-t Since you are also running on GKE, perhaps you have some idea what I've missed?

I don't understand this - how do they grow with cluster size? The buckets are constant.

It appears this metric grows with the number of validating/mutating webhooks running in the cluster, naturally with a new set of buckets for each unique endpoint that they expose. Here's a subset of some URLs I see reported by this metric in my cluster:

https://[::1]:443/<snip>
https://actions-runner-controller-webhook.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runner?timeout=30s
https://actions-runner-controller-webhook.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment?timeout=30s
https://actions-runner-controller-webhook.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset?timeout=30s
https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s
https://cnrm-validating-webhook.cnrm-system.svc:443/deny-immutable-field-updates?timeout=30s
https://cnrm-validating-webhook.cnrm-system.svc:443/deny-unknown-fields?timeout=30s
https://container.googleapis.com/%7Bprefix%7D
https://istiod.istio-system.svc:443/inject?timeout=10s

Not sure how helpful that is, but I imagine that's what was meant by @herewasmike

EDIT: For some additional information, running a query on apiserver_request_duration_seconds_bucket unfiltered returns 17420 series.

herewasmike · 2021-10-10T13:18:05Z

@logicalhan

If you are having issues with ingestion (i.e. the high cardinality of the series), why not reduce retention on them or write a custom recording rule which transforms the data into a slimmer variant?

Prometheus uses memory mainly for ingesting time-series into head.
And retention works only for disk usage when metrics are already flushed not before.
Changing scrape interval won't help much either, cause it's really cheap to ingest new point to existing time-series (it's just two floats with value and timestamp) and lots of memory ~8kb/ts required to store time-series itself (name, labels, etc.)
There's a possibility to setup federation and some recording rules, though, this looks like unwanted complexity for me and won't solve original issue with RAM usage.

Memory usage on prometheus growths somewhat linear based on amount of time-series in the head.
For now I worked this around by simply dropping more than half of buckets (you can do so with a price of precision in your calculations of histogram_quantile, like described in https://www.robustperception.io/why-are-prometheus-histograms-cumulative)

@wojtek-t

I don't understand this - how do they grow with cluster size? The buckets are constant.

As @bitwalker already mentioned, adding new resources multiplies cardinality of apiserver's metrics.
And with cluster growth you add them introducing more and more time-series (this is indirect dependency but still a pain point)

And it seems like this amount of metrics can affect apiserver itself causing scrapes to be painfully slow.

As for

The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting.

Summary will always provide you with more precise data than histogram
https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation

bitwalker · 2021-10-10T23:52:31Z

The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting.

Summary will always provide you with more precise data than histogram
https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation

I think summaries have their own issues; they are more expensive to calculate, hence why histograms were preferred for this metric, at least as I understand the context. Of course, it may be that the tradeoff would have been better in this case, I don't know what kind of testing/benchmarking was done.

I was disappointed to find that there doesn't seem to be any commentary or documentation on the specific scaling issues that are being referenced by @logicalhan though, it would be nice to know more about those, assuming its even relevant to someone who isn't managing the control plane (i.e. those of us on GKE).

I've been keeping an eye on my cluster this weekend, and the rule group evaluation durations seem to have stabilised:

That chart basically reflects the 99th percentile overall for rule group evaluations focused on the apiserver. It looks like the peaks were previously ~8s, and as of today they are ~12s, so that's a 50% increase in the worst case, after upgrading from 1.20 to 1.21. It does appear that the 90th percentile is roughly equivalent to where it was before the upgrade now, discounting the weird peak right after the upgrade. Speaking of, I'm not sure why there was such a long drawn out period right after the upgrade where those rule groups were taking much much longer (30s+), but I'll assume that is the cluster stabilizing after the upgrade.

Anyway, hope this additional follow up info is helpful!

k8s-triage-robot · 2022-01-09T00:32:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2022-01-10T07:23:40Z

We reduced the amount of time-series in #106306
At this point, we're not able to go visibly lower than that.

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 29, 2021

k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 29, 2021

k8s-ci-robot added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Sep 30, 2021

k8s-ci-robot assigned logicalhan Oct 5, 2021

k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 5, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 7, 2021

k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Oct 7, 2021

This was referenced Nov 4, 2021

Added requestSloLatencies metric #105890

Merged

Changed buckets for apiserver_request_duration_seconds metric #106306

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2022

wojtek-t closed this as completed Jan 10, 2022

CatherineF-dev mentioned this issue Jun 23, 2022

Replace metric apiserver_request_duration_seconds_bucket with trace #110742

Closed

yuvipanda mentioned this issue Aug 7, 2023

Investigate explosion of disk usage of prometheus 2i2c-org/infrastructure#2930

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apiserver latency metrics create enormous amount of time-series #105346

Apiserver latency metrics create enormous amount of time-series #105346

herewasmike commented Sep 29, 2021

herewasmike commented Sep 29, 2021

jpbetz commented Sep 30, 2021

fedebongio commented Oct 5, 2021

logicalhan commented Oct 7, 2021

logicalhan commented Oct 7, 2021

wojtek-t commented Oct 8, 2021

bitwalker commented Oct 9, 2021 •

edited

herewasmike commented Oct 10, 2021

bitwalker commented Oct 10, 2021

k8s-triage-robot commented Jan 9, 2022

wojtek-t commented Jan 10, 2022

Apiserver latency metrics create enormous amount of time-series #105346

Apiserver latency metrics create enormous amount of time-series #105346

Comments

herewasmike commented Sep 29, 2021

Background:

Problem

Proposal

herewasmike commented Sep 29, 2021

jpbetz commented Sep 30, 2021

fedebongio commented Oct 5, 2021

logicalhan commented Oct 7, 2021

logicalhan commented Oct 7, 2021

wojtek-t commented Oct 8, 2021

bitwalker commented Oct 9, 2021 • edited

herewasmike commented Oct 10, 2021

bitwalker commented Oct 10, 2021

k8s-triage-robot commented Jan 9, 2022

wojtek-t commented Jan 10, 2022

bitwalker commented Oct 9, 2021 •

edited