Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go metrics are tightly coupled with used go version #994

Open
prymitive opened this issue Mar 4, 2022 · 12 comments
Open

go metrics are tightly coupled with used go version #994

prymitive opened this issue Mar 4, 2022 · 12 comments

Comments

@prymitive
Copy link
Contributor

I've upgraded from Go 1.17.7 to 1.17.8 and I've noticed that exported metrics labels changed as a result.
le label on some histogram uses now different values than it used to:

--- a/metrics.txt
+++ b/metrics.txt
@@ -105,13 +105,13 @@ go_gc_heap_tiny_allocs_objects_total
 go_gc_pauses_seconds_total_bucket{le="-5e-324"}
 go_gc_pauses_seconds_total_bucket{le="9.999999999999999e-10"}
 go_gc_pauses_seconds_total_bucket{le="9.999999999999999e-09"}
-go_gc_pauses_seconds_total_bucket{le="1.2799999999999998e-07"}
-go_gc_pauses_seconds_total_bucket{le="1.2799999999999998e-06"}
-go_gc_pauses_seconds_total_bucket{le="1.6383999999999998e-05"}
-go_gc_pauses_seconds_total_bucket{le="0.00016383999999999998"}
-go_gc_pauses_seconds_total_bucket{le="0.0020971519999999997"}
-go_gc_pauses_seconds_total_bucket{le="0.020971519999999997"}
-go_gc_pauses_seconds_total_bucket{le="0.26843545599999996"}
+go_gc_pauses_seconds_total_bucket{le="9.999999999999998e-08"}
+go_gc_pauses_seconds_total_bucket{le="1.0239999999999999e-06"}
+go_gc_pauses_seconds_total_bucket{le="1.0239999999999999e-05"}
+go_gc_pauses_seconds_total_bucket{le="0.00010239999999999998"}
+go_gc_pauses_seconds_total_bucket{le="0.0010485759999999998"}
+go_gc_pauses_seconds_total_bucket{le="0.010485759999999998"}
+go_gc_pauses_seconds_total_bucket{le="0.10485759999999998"}
 go_gc_pauses_seconds_total_bucket{le="+Inf"}
 go_gc_pauses_seconds_total_sum NaN
 go_gc_pauses_seconds_total_count
@@ -240,13 +240,13 @@ go_sched_goroutines_goroutines
 go_sched_latencies_seconds_bucket{le="-5e-324"}
 go_sched_latencies_seconds_bucket{le="9.999999999999999e-10"}
 go_sched_latencies_seconds_bucket{le="9.999999999999999e-09"}
-go_sched_latencies_seconds_bucket{le="1.2799999999999998e-07"}
-go_sched_latencies_seconds_bucket{le="1.2799999999999998e-06"}
-go_sched_latencies_seconds_bucket{le="1.6383999999999998e-05"}
-go_sched_latencies_seconds_bucket{le="0.00016383999999999998"}
-go_sched_latencies_seconds_bucket{le="0.0020971519999999997"}
-go_sched_latencies_seconds_bucket{le="0.020971519999999997"}
-go_sched_latencies_seconds_bucket{le="0.26843545599999996"}
+go_sched_latencies_seconds_bucket{le="9.999999999999998e-08"}
+go_sched_latencies_seconds_bucket{le="1.0239999999999999e-06"}
+go_sched_latencies_seconds_bucket{le="1.0239999999999999e-05"}
+go_sched_latencies_seconds_bucket{le="0.00010239999999999998"}
+go_sched_latencies_seconds_bucket{le="0.0010485759999999998"}
+go_sched_latencies_seconds_bucket{le="0.010485759999999998"}
+go_sched_latencies_seconds_bucket{le="0.10485759999999998"}
 go_sched_latencies_seconds_bucket{le="+Inf"}
 go_sched_latencies_seconds_sum NaN
 go_sched_latencies_seconds_count

It's not a huge problem but it makes it harder to compare metrics between 2 instances of the same service compiled with different Go version. This is likely related to #967

@kakkoyun
Copy link
Member

@prymitive Thanks for opening the issue.

The raw metric values of a histogram haven't designed to be compared. Expectation is to use PromQL queries to handle the calculations. The buckets are dynamically generated, and this is expected.

I don't see any action items here, do you have any suggestions?

@prymitive
Copy link
Contributor Author

If buckets are dynamically generated then is it possible to get different set of buckets on each scrape?
Is this going to introduce metrics churn if there’s enough variance between gc runs?

@08d2
Copy link

08d2 commented Mar 26, 2022

I think this metric comes from runtime/metrics, and the Float64Histogram type states

For a given metric name, the value of Buckets is guaranteed not to change between calls until program exit

Which I guess implies the buckets can change from instance to instance. Which I guess means it's not actually appropriate to model as a Prometheus histogram, which expects bucket definitions to be more or less static?

@bwplotka
Copy link
Member

bwplotka commented Mar 28, 2022

I think this is indeed problematic. Let's find a solution to this - it would be ideal if we have static buckets incorporated. This can impact cardinality a lot here.

@bwplotka
Copy link
Member

cc @mknyszek for awareness.

@mknyszek
Copy link
Contributor

The buckets changed because I discovered they were very incorrect. I don't expect this to happen often (or again, even). See golang/go#50732.

@mknyszek
Copy link
Contributor

But, yes, the API is specified such that the buckets can change. If that doesn't work for Prometheus, then the bucket-choosing algorithm needs to pick its own buckets on the Prometheus side and aggregate the runtime/metrics buckets into them, with some error.

This is somewhat surprising to me because these histograms (at least on the runtime/metrics side) are intended to be used to look at and compare distributions. You can do that even if the buckets are different.

@08d2
Copy link

08d2 commented Mar 28, 2022

This is somewhat surprising to me because these histograms (at least on the runtime/metrics side) are intended to be used to look at and compare distributions. You can do that even if the buckets are different.

How can histograms with disparate bucketing be meaningfully aggregated and compared?

@mknyszek
Copy link
Contributor

There are many ways to do this based on how you want things visualized. Fundamentally histograms approximate some underlying distribution, a histogram is just a reasonably efficient way to represent a distribution. So you can extract and look at the distribution instead of the buckets directly.

For example, you could take two histograms with different bucketing, compute a CDF from both, and plot them on the same chart. If you are willing to make some distributional assumptions, such as a normal or geometric distribution, you can also summarize the histogram with percentiles.

In this sense, histograms are somewhat low-level. The Go runtime exposes a time histogram with a pretty fine granularity (which is currently squashed in the Prometheus code to reduce metric cardinality) as a stepping-stone toward producing more robust analyses.

There have been enough issues about this that perhaps this is just the wrong Prometheus data structure to represent the Go runtime histograms. We could use a summary instead, or some other collection of metrics.

@bwplotka
Copy link
Member

Yea, I think we already extrapolate buckets to have less of them, so the easiest thing is to extrapolate to stable buckets (A). Moving to summaries is not bad either, just have some instrumentation overhead, plus it is harder to use later on (not aggregatable) (B)

We could do A and hope for better histograms in Prometheus (feature that should be there soon)

@beorn7
Copy link
Member

beorn7 commented Apr 12, 2022

Summaries are not aggregatable, indeed, but that might not be so bad in the case of GC (where I assume you are mostly interested in individual processes, not necessarily many in aggregate). But Summaries also have static precalculated quantiles, which you cannot change in retrospect. (E.g. they are always the 99th perc, 90th perc, and median over the last 10m. If you later decide you would like to see the 75th percentile over the last 5m, you are lost.)

In summary (no pun intended), I find Histograms quite interesting here, but they are also very expensive and inaccurate. The new histograms will help with that.

Maybe you can make all of this configurable for now (I guess making these expensive histograms optional was considered anyawy). Or just decide what the least best solution is.

@stale
Copy link

stale bot commented Oct 16, 2022

Hello 👋 Looks like there was no activity on this issue for the last 3 months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next 4 weeks, this issue will be closed (we can always reopen an issue if we need!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants