go metrics are tightly coupled with used go version #994

prymitive · 2022-03-04T11:22:54Z

I've upgraded from Go 1.17.7 to 1.17.8 and I've noticed that exported metrics labels changed as a result.
le label on some histogram uses now different values than it used to:

--- a/metrics.txt
+++ b/metrics.txt
@@ -105,13 +105,13 @@ go_gc_heap_tiny_allocs_objects_total
 go_gc_pauses_seconds_total_bucket{le="-5e-324"}
 go_gc_pauses_seconds_total_bucket{le="9.999999999999999e-10"}
 go_gc_pauses_seconds_total_bucket{le="9.999999999999999e-09"}
-go_gc_pauses_seconds_total_bucket{le="1.2799999999999998e-07"}
-go_gc_pauses_seconds_total_bucket{le="1.2799999999999998e-06"}
-go_gc_pauses_seconds_total_bucket{le="1.6383999999999998e-05"}
-go_gc_pauses_seconds_total_bucket{le="0.00016383999999999998"}
-go_gc_pauses_seconds_total_bucket{le="0.0020971519999999997"}
-go_gc_pauses_seconds_total_bucket{le="0.020971519999999997"}
-go_gc_pauses_seconds_total_bucket{le="0.26843545599999996"}
+go_gc_pauses_seconds_total_bucket{le="9.999999999999998e-08"}
+go_gc_pauses_seconds_total_bucket{le="1.0239999999999999e-06"}
+go_gc_pauses_seconds_total_bucket{le="1.0239999999999999e-05"}
+go_gc_pauses_seconds_total_bucket{le="0.00010239999999999998"}
+go_gc_pauses_seconds_total_bucket{le="0.0010485759999999998"}
+go_gc_pauses_seconds_total_bucket{le="0.010485759999999998"}
+go_gc_pauses_seconds_total_bucket{le="0.10485759999999998"}
 go_gc_pauses_seconds_total_bucket{le="+Inf"}
 go_gc_pauses_seconds_total_sum NaN
 go_gc_pauses_seconds_total_count
@@ -240,13 +240,13 @@ go_sched_goroutines_goroutines
 go_sched_latencies_seconds_bucket{le="-5e-324"}
 go_sched_latencies_seconds_bucket{le="9.999999999999999e-10"}
 go_sched_latencies_seconds_bucket{le="9.999999999999999e-09"}
-go_sched_latencies_seconds_bucket{le="1.2799999999999998e-07"}
-go_sched_latencies_seconds_bucket{le="1.2799999999999998e-06"}
-go_sched_latencies_seconds_bucket{le="1.6383999999999998e-05"}
-go_sched_latencies_seconds_bucket{le="0.00016383999999999998"}
-go_sched_latencies_seconds_bucket{le="0.0020971519999999997"}
-go_sched_latencies_seconds_bucket{le="0.020971519999999997"}
-go_sched_latencies_seconds_bucket{le="0.26843545599999996"}
+go_sched_latencies_seconds_bucket{le="9.999999999999998e-08"}
+go_sched_latencies_seconds_bucket{le="1.0239999999999999e-06"}
+go_sched_latencies_seconds_bucket{le="1.0239999999999999e-05"}
+go_sched_latencies_seconds_bucket{le="0.00010239999999999998"}
+go_sched_latencies_seconds_bucket{le="0.0010485759999999998"}
+go_sched_latencies_seconds_bucket{le="0.010485759999999998"}
+go_sched_latencies_seconds_bucket{le="0.10485759999999998"}
 go_sched_latencies_seconds_bucket{le="+Inf"}
 go_sched_latencies_seconds_sum NaN
 go_sched_latencies_seconds_count

It's not a huge problem but it makes it harder to compare metrics between 2 instances of the same service compiled with different Go version. This is likely related to #967

kakkoyun · 2022-03-17T19:47:41Z

@prymitive Thanks for opening the issue.

The raw metric values of a histogram haven't designed to be compared. Expectation is to use PromQL queries to handle the calculations. The buckets are dynamically generated, and this is expected.

I don't see any action items here, do you have any suggestions?

prymitive · 2022-03-17T21:09:04Z

If buckets are dynamically generated then is it possible to get different set of buckets on each scrape?
Is this going to introduce metrics churn if there’s enough variance between gc runs?

08d2 · 2022-03-26T20:00:34Z

I think this metric comes from runtime/metrics, and the Float64Histogram type states

For a given metric name, the value of Buckets is guaranteed not to change between calls until program exit

Which I guess implies the buckets can change from instance to instance. Which I guess means it's not actually appropriate to model as a Prometheus histogram, which expects bucket definitions to be more or less static?

bwplotka · 2022-03-28T12:25:08Z

I think this is indeed problematic. Let's find a solution to this - it would be ideal if we have static buckets incorporated. This can impact cardinality a lot here.

bwplotka · 2022-03-28T12:25:48Z

cc @mknyszek for awareness.

mknyszek · 2022-03-28T15:47:09Z

The buckets changed because I discovered they were very incorrect. I don't expect this to happen often (or again, even). See golang/go#50732.

mknyszek · 2022-03-28T15:52:58Z

But, yes, the API is specified such that the buckets can change. If that doesn't work for Prometheus, then the bucket-choosing algorithm needs to pick its own buckets on the Prometheus side and aggregate the runtime/metrics buckets into them, with some error.

This is somewhat surprising to me because these histograms (at least on the runtime/metrics side) are intended to be used to look at and compare distributions. You can do that even if the buckets are different.

08d2 · 2022-03-28T16:59:47Z

This is somewhat surprising to me because these histograms (at least on the runtime/metrics side) are intended to be used to look at and compare distributions. You can do that even if the buckets are different.

How can histograms with disparate bucketing be meaningfully aggregated and compared?

mknyszek · 2022-03-28T17:17:04Z

There are many ways to do this based on how you want things visualized. Fundamentally histograms approximate some underlying distribution, a histogram is just a reasonably efficient way to represent a distribution. So you can extract and look at the distribution instead of the buckets directly.

For example, you could take two histograms with different bucketing, compute a CDF from both, and plot them on the same chart. If you are willing to make some distributional assumptions, such as a normal or geometric distribution, you can also summarize the histogram with percentiles.

In this sense, histograms are somewhat low-level. The Go runtime exposes a time histogram with a pretty fine granularity (which is currently squashed in the Prometheus code to reduce metric cardinality) as a stepping-stone toward producing more robust analyses.

There have been enough issues about this that perhaps this is just the wrong Prometheus data structure to represent the Go runtime histograms. We could use a summary instead, or some other collection of metrics.

bwplotka · 2022-04-12T10:21:36Z

Yea, I think we already extrapolate buckets to have less of them, so the easiest thing is to extrapolate to stable buckets (A). Moving to summaries is not bad either, just have some instrumentation overhead, plus it is harder to use later on (not aggregatable) (B)

We could do A and hope for better histograms in Prometheus (feature that should be there soon)

beorn7 · 2022-04-12T11:35:25Z

Summaries are not aggregatable, indeed, but that might not be so bad in the case of GC (where I assume you are mostly interested in individual processes, not necessarily many in aggregate). But Summaries also have static precalculated quantiles, which you cannot change in retrospect. (E.g. they are always the 99th perc, 90th perc, and median over the last 10m. If you later decide you would like to see the 75th percentile over the last 5m, you are lost.)

In summary (no pun intended), I find Histograms quite interesting here, but they are also very expensive and inaccurate. The new histograms will help with that.

Maybe you can make all of this configurable for now (I guess making these expensive histograms optional was considered anyawy). Or just decide what the least best solution is.

stale · 2022-10-16T05:36:53Z

Hello 👋 Looks like there was no activity on this issue for the last 3 months. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next 4 weeks, this issue will be closed (we can always reopen an issue if we need!).

This was referenced Apr 12, 2022

gocollector: Added options to Go Collector for changing the collections #1031

Merged

gocollector: Reverted client_golang v1.12 addition of runtime/metrics metrics by default #1033

Merged

stale bot added the stale label Oct 16, 2022

kakkoyun added keep-open and removed stale labels Oct 26, 2022

mknyszek mentioned this issue Oct 24, 2023

runtime: diagnostics improvements tracking issue golang/go#57175

Open

19 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

go metrics are tightly coupled with used go version #994

go metrics are tightly coupled with used go version #994

prymitive commented Mar 4, 2022

kakkoyun commented Mar 17, 2022

prymitive commented Mar 17, 2022

08d2 commented Mar 26, 2022

bwplotka commented Mar 28, 2022 •

edited

bwplotka commented Mar 28, 2022

mknyszek commented Mar 28, 2022

mknyszek commented Mar 28, 2022

08d2 commented Mar 28, 2022

mknyszek commented Mar 28, 2022

bwplotka commented Apr 12, 2022

beorn7 commented Apr 12, 2022

stale bot commented Oct 16, 2022

go metrics are tightly coupled with used go version #994

go metrics are tightly coupled with used go version #994

Comments

prymitive commented Mar 4, 2022

kakkoyun commented Mar 17, 2022

prymitive commented Mar 17, 2022

08d2 commented Mar 26, 2022

bwplotka commented Mar 28, 2022 • edited

bwplotka commented Mar 28, 2022

mknyszek commented Mar 28, 2022

mknyszek commented Mar 28, 2022

08d2 commented Mar 28, 2022

mknyszek commented Mar 28, 2022

bwplotka commented Apr 12, 2022

beorn7 commented Apr 12, 2022

stale bot commented Oct 16, 2022

bwplotka commented Mar 28, 2022 •

edited