Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apiserver latency metrics create enormous amount of time-series #105346

Closed
herewasmike opened this issue Sep 29, 2021 · 11 comments
Closed

Apiserver latency metrics create enormous amount of time-series #105346

herewasmike opened this issue Sep 29, 2021 · 11 comments
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@herewasmike
Copy link

Background:

In scope of #73638 and kubernetes-sigs/controller-runtime#1273 amount of buckets for this histogram was increased to 40(!)
Buckets: []float64{0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 60},

Problem

This causes anyone who still wants to monitor apiserver to handle tons of metrics.
Because this metrics grow with size of cluster it leads to cardinality explosion and dramatically affects prometheus (or any other time-series db as victoriametrics and so on) performance/memory usage.

E.g. from one of my clusters:

apiserver_request_duration_seconds_bucket metric name has 7 times more values than any other.

image

Proposal

There's some possible solutions for this issue.

One would be allowing end-user to define buckets for apiserver.

Pros: We still use histograms that are cheap for apiserver (though, not sure how good this works for 40 buckets case 😃 )
Cons:

  • Requires end user to understand what happens
  • Adds another moving part in the system (violate KISS principle)
  • Doesn't work well in case there is not homogeneous load (e.g. requests to some api are served within hundreds of milliseconds and other in 10-20 seconds )

Second one is to use summary for this purpose.
Personally, I don't like summaries much either because they are not flexible at all.
Though, histograms require one to define buckets suitable for the case. Adding all possible options (as was done in commits pointed above) is not a solution

Pros:

  • Significantly reduce amount of time-series returned by apiserver's metrics page as summary uses one ts per defined percentile + 2 (_sum and _count)
  • Solves this issue entirely
  • Still simple and stupid
    Cons:
  • Requires slightly more resources on apiserver's side to calculate percentiles
  • Percentiles have to be defined in code and can't be changed during runtime (though, most use cases are covered by 0.5, 0.95 and 0.99 percentiles so personally I would just hardcode them)

Appreciate any feedback on this request.

@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Sep 29, 2021
@herewasmike
Copy link
Author

I believe this should go to
/sig api-machinery

Please correct me if I'm wrong

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 29, 2021
@jpbetz
Copy link
Contributor

jpbetz commented Sep 30, 2021

/sig instrumentation

@k8s-ci-robot k8s-ci-robot added the sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. label Sep 30, 2021
@fedebongio
Copy link
Contributor

/assign @logicalhan
(assigning to sig instrumentation)
/remove-sig api-machinery

@k8s-ci-robot k8s-ci-robot removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 5, 2021
@logicalhan
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 7, 2021
@logicalhan
Copy link
Member

/sig scalability
/cc @wojtek-t

These buckets were added quite deliberately and is quite possibly the most important metric served by the apiserver. The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting. If you are having issues with ingestion (i.e. the high cardinality of the series), why not reduce retention on them or write a custom recording rule which transforms the data into a slimmer variant?

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Oct 7, 2021
@wojtek-t
Copy link
Member

wojtek-t commented Oct 8, 2021

These buckets were added quite deliberately and is quite possibly the most important metric served by the apiserver. The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting. If you are having issues with ingestion (i.e. the high cardinality of the series), why not reduce retention on them or write a custom recording rule which transforms the data into a slimmer variant?

+1 to all of that

Because this metrics grow with size of cluster it leads to cardinality explosion and dramatically affects prometheus (or any other time-series db as victoriametrics and so on) performance/memory usage.

I don't understand this - how do they grow with cluster size? The buckets are constant.

[FWIW - we're monitoring it for every GKE cluster and it works for us...]

@bitwalker
Copy link

bitwalker commented Oct 9, 2021

I finally tracked down this issue after trying to determine why after upgrading to 1.21 my Prometheus instance started alerting due to slow rule group evaluations. After doing some digging, it turned out the problem is that simply scraping the metrics endpoint for the apiserver takes around 5-10s on a regular basis, which ends up causing rule groups which scrape those endpoints to fall behind, hence the alerts.

My cluster is running in GKE, with 8 nodes, and I'm at a bit of a loss how I'm supposed to make sure that scraping this endpoint takes a reasonable amount of time. Are the series reset after every scrape, so scraping more frequently will actually be faster? Regardless, 5-10s for a small cluster like mine seems outrageously expensive. If there is a recommended approach to deal with this, I'd love to know what that is, as the issue for me isn't storage or retention of high cardinality series, its that the metrics endpoint itself is very slow to respond due to all of the time series.

@wojtek-t Since you are also running on GKE, perhaps you have some idea what I've missed?

I don't understand this - how do they grow with cluster size? The buckets are constant.

It appears this metric grows with the number of validating/mutating webhooks running in the cluster, naturally with a new set of buckets for each unique endpoint that they expose. Here's a subset of some URLs I see reported by this metric in my cluster:

https://[::1]:443/<snip>
https://actions-runner-controller-webhook.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runner?timeout=30s
https://actions-runner-controller-webhook.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runnerdeployment?timeout=30s
https://actions-runner-controller-webhook.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runnerreplicaset?timeout=30s
https://cert-manager-webhook.cert-manager.svc:443/convert?timeout=30s
https://cnrm-validating-webhook.cnrm-system.svc:443/deny-immutable-field-updates?timeout=30s
https://cnrm-validating-webhook.cnrm-system.svc:443/deny-unknown-fields?timeout=30s
https://container.googleapis.com/%7Bprefix%7D
https://istiod.istio-system.svc:443/inject?timeout=10s

Not sure how helpful that is, but I imagine that's what was meant by @herewasmike

EDIT: For some additional information, running a query on apiserver_request_duration_seconds_bucket unfiltered returns 17420 series.

@herewasmike
Copy link
Author

@logicalhan

If you are having issues with ingestion (i.e. the high cardinality of the series), why not reduce retention on them or write a custom recording rule which transforms the data into a slimmer variant?

Prometheus uses memory mainly for ingesting time-series into head.
And retention works only for disk usage when metrics are already flushed not before.
Changing scrape interval won't help much either, cause it's really cheap to ingest new point to existing time-series (it's just two floats with value and timestamp) and lots of memory ~8kb/ts required to store time-series itself (name, labels, etc.)
There's a possibility to setup federation and some recording rules, though, this looks like unwanted complexity for me and won't solve original issue with RAM usage.

Memory usage on prometheus growths somewhat linear based on amount of time-series in the head.
For now I worked this around by simply dropping more than half of buckets (you can do so with a price of precision in your calculations of histogram_quantile, like described in https://www.robustperception.io/why-are-prometheus-histograms-cumulative)

@wojtek-t

I don't understand this - how do they grow with cluster size? The buckets are constant.

As @bitwalker already mentioned, adding new resources multiplies cardinality of apiserver's metrics.
And with cluster growth you add them introducing more and more time-series (this is indirect dependency but still a pain point)
image

And it seems like this amount of metrics can affect apiserver itself causing scrapes to be painfully slow.

As for

The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting.

Summary will always provide you with more precise data than histogram
https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation

@bitwalker
Copy link

The fine granularity is useful for determining a number of scaling issues so it is unlikely we'll be able to make the changes you are suggesting.

Summary will always provide you with more precise data than histogram
https://prometheus.io/docs/practices/histograms/#errors-of-quantile-estimation

I think summaries have their own issues; they are more expensive to calculate, hence why histograms were preferred for this metric, at least as I understand the context. Of course, it may be that the tradeoff would have been better in this case, I don't know what kind of testing/benchmarking was done.

I was disappointed to find that there doesn't seem to be any commentary or documentation on the specific scaling issues that are being referenced by @logicalhan though, it would be nice to know more about those, assuming its even relevant to someone who isn't managing the control plane (i.e. those of us on GKE).

I've been keeping an eye on my cluster this weekend, and the rule group evaluation durations seem to have stabilised:

Screen Shot 2021-10-10 at 7 36 31 PM

That chart basically reflects the 99th percentile overall for rule group evaluations focused on the apiserver. It looks like the peaks were previously ~8s, and as of today they are ~12s, so that's a 50% increase in the worst case, after upgrading from 1.20 to 1.21. It does appear that the 90th percentile is roughly equivalent to where it was before the upgrade now, discounting the weird peak right after the upgrade. Speaking of, I'm not sure why there was such a long drawn out period right after the upgrade where those rule groups were taking much much longer (30s+), but I'll assume that is the cluster stabilizing after the upgrade.

Anyway, hope this additional follow up info is helpful!

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 9, 2022
@wojtek-t
Copy link
Member

We reduced the amount of time-series in #106306
At this point, we're not able to go visibly lower than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

8 participants