Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the runtime/metrics package for the Go collector for 1.17+ #955

Merged
merged 1 commit into from Jan 16, 2022

Conversation

mknyszek
Copy link
Contributor

This change introduces use of the runtime/metrics package in place of
runtime.MemStats for Go 1.17 or later. The runtime/metrics package was
introduced in Go 1.16, but not all the old metrics were accounted for
until 1.17.

The runtime/metrics package offers several advantages over using
runtime.MemStats:

  • The list of metrics and their descriptions are machine-readable,
    allowing new metrics to get added without any additional work.
  • Detailed histogram-based metrics are now available, offering much
    deeper insights into the Go runtime.
  • The runtime/metrics API is significantly more efficient than
    runtime.MemStats, even with the additional metrics added, because
    it does not require any stop-the-world events.

That being said, integrating the package comes with some caveats, some
of which were discussed in #842. Namely:

  • The old MemStats-based metrics need to continue working, so they're
    exported under their old names backed by equivalent runtime/metrics
    metrics.
  • Earlier versions of Go need to continue working, so the old code
    remains, but behind a build tag.

Finally, a few notes about the implementation:

  • This change includes a whole bunch of refactoring to avoid significant
    code duplication.
  • This change adds a new histogram metric type specifically optimized
    for runtime/metrics histograms. This type's methods also include
    additional logic to deal with differences in bounds conventions.
  • This change makes a whole bunch of decisions about how runtime/metrics
    names are translated.

The biggest question I have is what to do to surface a few situations. Namely:

  1. When there's a new metric value type that isn't yet supported and,
  2. When a new metric can't have its name automatically translated to a Prometheus metric name.

(1) may happen, but it'll be very rare. (2) should basically never happen, other than by mistake. It would still be good to catch it somehow instead of having it get silently ignored.

Apologies for the absolutely enormous diff. Because of the build tagging I had to do a whole bunch of refactoring. If you think it would be better to merge the 1.16 code back into the original files, I can do that to make the diff smaller and easier to review.

@kakkoyun @bwplotka

Copy link
Member

@kakkoyun kakkoyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Overall I'm happy with the decisions are made here. Thanks, @mknyszek.

When it comes to your questions. I think we need to fail as early as possible if we come across a metric name that we can't transform or a new metric type. At least these should be caught by our tests and CI. Assuming these would only happen with a new Go version, these failures would be bearable. We have Go version-specific tests runs so it should be easy to catch, fix and release.

What do you think @bwplotka?

P.S: We have updated our CI pipeline. If you can rebase we can it against go 1.17.

prometheus/go_collector_go117.go Show resolved Hide resolved
prometheus/go_collector_go117.go Show resolved Hide resolved
prometheus/go_collector_go117.go Show resolved Hide resolved
@kakkoyun kakkoyun requested a review from bwplotka January 5, 2022 10:36
@mknyszek mknyszek force-pushed the runtime-metrics branch 2 times, most recently from d54ca17 to 384ec16 Compare January 6, 2022 08:28
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

I generally like this, but I think we might want to take more careful approach in three areas:

  • How new mechanism is initiated. I believe we should have something like `NewGoCollector(type, errReporter), where one can still run the MemStat based implementation. The reason is around unknowns in terms of behaviour, metric output, errors and performance of the new implementation.
  • We should indeed catch errors in some way. We could ignore it by default, but especially during tests we should fail fast on those cases (to know about them!)
  • I am not fan of dynamic renaming in this situation. This can go very wrong and break majority of alerting/monitoring setups in a hidden way if go_memory_allocs_bytes_total will suddenly become go_mem_allocs_bytes_total. I believe, since we established those namings in first place, as a client_golang we should be an adapter and ensurer of those names. I feel keeping those names static and fetching expected path might be better. Yes, it will be more manual but we would control the output, cardinality and exact naming. We don't leave it to unknown. I expect those path to be changed/adjusted regularly by the Go Team. WDYT? Happy to discuss, I know it's game of trade-offs here, but I would put higher metric output compatitibility vs LOC

d := &descriptions[i]
name, ok := nameFromRuntimeMetrics(d)
if !ok {
// TODO: Log this somewhere?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to return this error I think. We could do that in form of "warning" so return metrics which has skipped due to certain error some can provide to us and we can fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!

I generally like this, but I think we might want to take more careful approach in three areas:

  • How new mechanism is initiated. I believe we should have something like `NewGoCollector(type, errReporter), where one can still run the MemStat based implementation. The reason is around unknowns in terms of behaviour, metric output, errors and performance of the new implementation.

So the old behavior should be opt-in? I can make that work. However, I am wary about a new API for the new collector because I really want this update to be transparent. Personally, I don't see a reason why it couldn't be, and it would provide a serious benefit to users down the line: when something goes wrong, they'll have many more tools to investigate and fix it. runtime/metrics was born out of the frustration that updating runtime.MemStats is very high-stakes, so we as the Go team simply have not done so.

  • We should indeed catch errors in some way. We could ignore it by default, but especially during tests we should fail fast on those cases (to know about them!)

Right. The only case where I wasn't sure how to surface errors was in coming across metric types (that should be added extremely infrequently; think of it like reflect.Type's Kind) that Prometheus doesn't understand.

  • I am not fan of dynamic renaming in this situation. This can go very wrong and break majority of alerting/monitoring setups in a hidden way if go_memory_allocs_bytes_total will suddenly become go_mem_allocs_bytes_total. I believe, since we established those namings in first place, as a client_golang we should be an adapter and ensurer of those names. I feel keeping those names static and fetching expected path might be better. Yes, it will be more manual but we would control the output, cardinality and exact naming. We don't leave it to unknown. I expect those path to be changed/adjusted regularly by the Go Team. WDYT? Happy to discuss, I know it's game of trade-offs here, but I would put higher metric output compatitibility vs LOC

My goal here (and the goal of the runtime/metrics package) is transparent inclusion of new metrics, just by upgrading your Go version. The API is carefully designed so that adding or removing metrics is possible, but in practice will only happen slowly and rarely (down the line, probably one new metric each release on average; removing a metric will generally be exceptional). This escape hatch exists to eventually phase out metrics that really just aren't meaningful anymore and add noise. This is definitely not about lines of code, but about giving users (and Go maintainers) more visibility into their application. Manually listing each one runs counter to that. Renamings like memory to mem should be assumed to never happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, sorry for lag, had knee surgery last week ): Let's get back to this!

when something goes wrong, they'll have many more tools to investigate and fix it. runtime/metrics was born out of the frustration that updating runtime.MemStats is very high-stakes, so we as the Go team simply have not done so.

What tools do users have if our new implementation will break their setup "accidentally" since we assume this change only need minor version bump? I guess revert only. I am not worried too much about switching to runtime/metrics, but rather our new implementation that has to stabilize (e.g we need to ensure it has the right perf etc). I think I would be ok with the new solution being by default replaced and having old under new opt-in and removed in next major version possibly. Really just for the sake of iterating on it and allow user to have smooth experience e.g there is a problem they can still use new version of client_golang just explicitly use old collector for now. After your points I am happy to discuss this still, not a blocker (:

Right. The only case where I wasn't sure how to surface errors was in coming across metric types (that should be added extremely infrequently; think of it like reflect.Type's Kind) that Prometheus doesn't understand.

Yup. Some optional ErrorHandler func(error) pattern would do IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My goal here (and the goal of the runtime/metrics package) is transparent inclusion of new metrics, just by upgrading your Go version. The API is carefully designed so that adding or removing metrics is possible, but in practice will only happen slowly and rarely (down the line, probably one new metric each release on average; removing a metric will generally be exceptional). This escape hatch exists to eventually phase out metrics that really just aren't meaningful anymore and add noise. This is definitely not about lines of code, but about giving users (and Go maintainers) more visibility into their application. Manually listing each one runs counter to that. Renamings like memory to mem should be assumed to never happen.

Got it, I think it's really around trust here - we delegate API stability to Go itself. I think it's fair to assume and do but we need to document this. I think I could agree with this approach, although for this PR and any iteration on renaming logic I would argue we need test suite that test ALL list of metrics for each Go version. Maybe we can cumulatively add that and start with small list in this PR. WDYT? @mknyszek @kakkoyun @beorn7

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tools do users have if our new implementation will break their setup "accidentally" since we assume this change only need minor version bump? I guess revert only. I am not worried too much about switching to runtime/metrics, but rather our new implementation that has to stabilize (e.g we need to ensure it has the right perf etc). I think I would be ok with the new solution being by default replaced and having old under new opt-in and removed in next major version possibly. Really just for the sake of iterating on it and allow user to have smooth experience e.g there is a problem they can still use new version of client_golang just explicitly use old collector for now. After your points I am happy to discuss this still, not a blocker (:

Ok, I think your latest version is actually even more different. We kind of have old metrics with almost precisely same values taken from runtime/metrics PLUS runtime/metircs specific ones with new names. I think that is ok to keep this way 👍🏽

prometheus/go_collector_go117.go Outdated Show resolved Hide resolved
prometheus/go_collector_go117.go Outdated Show resolved Hide resolved
}
ch <- h
case metrics.KindBad:
// TODO: Unsupported metric. Log this somewhere?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. this is scary. We might want to introduce some errReporter logic so user can choose how to be informed (by default ignore)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and it tests we should check that for sure!(not ignore, but panic or return err)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a panic here, skipped collection of these metrics (so the panic shouldn't fire), and added a test that allows for auditing of these cases for a new Go version.

prometheus/go_collector_go117.go Show resolved Hide resolved
prometheus/go_collector_go117.go Outdated Show resolved Hide resolved
prometheus/go_collector_go117.go Outdated Show resolved Hide resolved
got metrics.Description
expect string
}{
{metrics.Description{Name: "/memory/live:bytes"}, "go_memory_live_bytes"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check all metrics to ensure we know what change in terms of metric naming of those across Go versions? Essentially it's quite scary if Go suddenly changes the internal naming of one important metrics e.g go_memory_allocs_bytes_total which makes our code to produce e.g go_memory_alloc_bytes_total instead. This will break many setups in a hidden way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I elaborated on this on the main comment thread, but I'll note again for posterity:

  1. the Go team takes backwards compatibility very seriously, and metric names effectively never change meaning (see the documentation of the package https://pkg.go.dev/runtime/metrics). new metrics will be added with some regularity, and old ones eventually removed as they no longer become useful (less regularly). they will always be announced ahead of time, and will happen at a max speed of once every six months (far, far less than that in practice).
  2. I added a test that, for a given Go release, checks for the set of expected metrics. the new set of expected metrics for a Go release can be regenerated with a go generate script I added to the source tree. this way, at least to the Prometheus team, it reduces the risk of anything breaking silently and differences may be more easily audited.

do you feel like this alleviates your concerns? I'm happy to discuss more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a test, I think I am happy, commented more above 🤗 Thanks!

@bwplotka
Copy link
Member

bwplotka commented Jan 6, 2022

I think I would be also fine with runtime transform implementation as long as it's within another constructor, not try to be 1:1 replacement with a build tag. WDYT? (:

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 6, 2022

What are your concerns about having it be the same API? If it's catching new kinds of values supported by runtime/metrics, that can be done in testing, meanwhile any such metrics will not show up for users (i.e. replacing those TODO's with setting some kind of flag that is checked at test time, but otherwise ignored).

If you have concerns about backwards compatibility and actually not having existing metrics spontaneously disappear, I will note that the Go team takes backwards compatibility extremely seriously. Metrics will not be removed lightly or without careful consideration for downstream users (the API also carefully lays out guarantees on what can be relied upon -- for instance, for a given metric key, its corresponding Kind can never change).

I'm a little surprised by the pushback given that I thought I was on the same page with @beorn7 when we discussed this around a year ago, but I recognize that the project has new maintainers and maybe that means a different direction. I'd really like to try to make the transparency work for you, and I'm happy to work on alleviating whatever concerns you might have about it.

@kakkoyun
Copy link
Member

kakkoyun commented Jan 7, 2022

I'm a little surprised by the pushback given that I thought I was on the same page with @beorn7 when we discussed this around a year ago, but I recognize that the project has new maintainers and maybe that means a different direction. I'd really like to try to make the transparency work for you, and I'm happy to work on alleviating whatever concerns you might have about it.

I don't think we're pushing back here. I read all the previous discussions with @beorn7 and I agree with the decisions. As @bwplotka iterated we just need to come up with a reporting mechanism and fail the test if something unexpected happens, otherwise, just print out some warnings. Personally, I'm not that concerned about name and type changes. These won't be random changes and they will seldomly change, worst case every 6 months. We will be testing against new Go versions anyways and I believe we can catch those. One thing we can do is to maintain an expected metric list if it'd make you feel safer Bartek.

I'm happy to accept the changeset with additional testing.

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 7, 2022

I don't think we're pushing back here. I read all the previous discussions with @beorn7 and I agree with the decisions. As @bwplotka iterated we just need to come up with a reporting mechanism and fail the test if something unexpected happens, otherwise, just print out some warnings. Personally, I'm not that concerned about name and type changes. These won't be random changes and they will seldomly change, worst case every 6 months. We will be testing against new Go versions anyways and I believe we can catch those. One thing we can do is to maintain an expected metric list if it'd make you feel safer Bartek.

Understood, thanks! I also want to be clear that if either of you did want to move in a different direction that would be OK (you are the maintainers of this project after all!), but I would like to discuss it a bit more. :)

I can add an expected metrics list test and a generator to update that for new Go versions.

It occurs to me that the tests do not currently check that the transformed names remain unchanged (and for that you need an expected metrics list anyway). Maybe this is what you (@bwplotka) were getting at. If so, my apologies for misunderstanding.

This change introduces use of the runtime/metrics package in place of
runtime.MemStats for Go 1.17 or later. The runtime/metrics package was
introduced in Go 1.16, but not all the old metrics were accounted for
until 1.17.

The runtime/metrics package offers several advantages over using
runtime.MemStats:
* The list of metrics and their descriptions are machine-readable,
  allowing new metrics to get added without any additional work.
* Detailed histogram-based metrics are now available, offering much
  deeper insights into the Go runtime.
* The runtime/metrics API is significantly more efficient than
  runtime.MemStats, even with the additional metrics added, because
  it does not require any stop-the-world events.

That being said, integrating the package comes with some caveats, some
of which were discussed in prometheus#842. Namely:
* The old MemStats-based metrics need to continue working, so they're
  exported under their old names backed by equivalent runtime/metrics
  metrics.
* Earlier versions of Go need to continue working, so the old code
  remains, but behind a build tag.

Finally, a few notes about the implementation:
* This change includes a whole bunch of refactoring to avoid significant
  code duplication.
* This change adds a new histogram metric type specifically optimized
  for runtime/metrics histograms. This type's methods also include
  additional logic to deal with differences in bounds conventions.
* This change makes a whole bunch of decisions about how runtime/metrics
  names are translated.
* This change adds a `go generate` script to generate a list of expected
  runtime/metrics names for a given Go version for auditing. Users of
  new versions of Go will transparently be allowed to use new metrics,
  however.

Signed-off-by: Michael Anthony Knyszek <mknyszek@google.com>
@mknyszek
Copy link
Contributor Author

I think I addressed all testing and efficiency concerns, so please take another look @bwplotka! Thanks y'all for taking the time.

// this case, we control all the types here.
switch m := c.rmMetrics[i].(type) {
case *counter:
// Guard against decreases. This should never happen, but a failure
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this guard is really working around a monotonicity bug in the runtime. :(

in practice it shouldn't be necessary. I have a fix for the runtime but I don't have a timeline for getting that landed, and whether or not it'll be backported. ideally this wouldn't be necessary and just the "Add" would be sufficient.

that being said, maybe this m.get() thing isn't the right way to do this. maybe the better way is to have this be a Gauge and use Set, but somehow indicate to the rest of Prometheus that it should be treated like a counter. any feedback would be much appreciated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this a little bit with the team and they noted that the behavior is only very transiently non-monotonic. if you're sampling from runtime/metrics really frequently (like, in a loop) you'll see it, but on time scales of seconds in an active application, it's unlikely. An existing MemStats metric that is reported as a CounterValue already suffers from this in theory, but I don't think it's ever come up as a problem in practice. I think this guard might be sufficient? I'll also consider just moving this guard into the runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard is perfect, and an explanation around this would do. I don't see any harm if we truly believe it's monotonic, just transiently it might be not. And indeed if we want this metric to be used as a counter (people using rate on it), we need the correct type in metric exposition.

I don't like the fact we need to transform counter into another counter (m.Add(unwrapScalarRMValue(sample.Value) - m.get())) only because our interface is missing use case of storing the metric state in different variable instead of reporting it through Add or Inc. I guess we need to live with this for now.

@kakkoyun
Copy link
Member

Hey @mknyszek, FYI @bwplotka is off for a bit. Please be patient :)

@mknyszek
Copy link
Contributor Author

Hey @mknyszek, FYI @bwplotka is off for a bit. Please be patient :)

Yes! Of course. I didn't mean to rush anyone.

"github.com/prometheus/client_golang/prometheus/internal"
)

func main() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

nil,
),
hasSum,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could do continue in those blocks instead of else for (opinionated) readability

// this case, we control all the types here.
switch m := c.rmMetrics[i].(type) {
case *counter:
// Guard against decreases. This should never happen, but a failure
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard is perfect, and an explanation around this would do. I don't see any harm if we truly believe it's monotonic, just transiently it might be not. And indeed if we want this metric to be used as a counter (people using rate on it), we need the correct type in metric exposition.

I don't like the fact we need to transform counter into another counter (m.Add(unwrapScalarRMValue(sample.Value) - m.get())) only because our interface is missing use case of storing the metric state in different variable instead of reporting it through Add or Inc. I guess we need to live with this for now.

return 0
}

// Currently, MemStats adds tiny alloc count to both Mallocs AND Frees.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "tiny alloc count"? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's supposed to be an implementation detail in the allocator for objects <16 bytes in size without pointers, but it's unfortunately a little leaky when it comes to runtime.MemStats, since it does get counted there. It also gets added to both allocs and frees because we can't efficiently count frees of these objects. This accounting is incorrect, because the blocks that hold these tiny objects are already accounted for (and thus the tiny objects themselves are, too), but with MemStats there was no other way to surface this information.

Also, it's perfectly fine to ignore the tiny allocator the vast majority of situations as a result, and the additional metric is there if you really want to know (it can cause some liveness issues sometimes, which is useful to debug). https://pkg.go.dev/runtime/metrics has a few more details.

d := &descriptions[i]
name, ok := nameFromRuntimeMetrics(d)
if !ok {
// TODO: Log this somewhere?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What tools do users have if our new implementation will break their setup "accidentally" since we assume this change only need minor version bump? I guess revert only. I am not worried too much about switching to runtime/metrics, but rather our new implementation that has to stabilize (e.g we need to ensure it has the right perf etc). I think I would be ok with the new solution being by default replaced and having old under new opt-in and removed in next major version possibly. Really just for the sake of iterating on it and allow user to have smooth experience e.g there is a problem they can still use new version of client_golang just explicitly use old collector for now. After your points I am happy to discuss this still, not a blocker (:

Ok, I think your latest version is actually even more different. We kind of have old metrics with almost precisely same values taken from runtime/metrics PLUS runtime/metircs specific ones with new names. I think that is ok to keep this way 👍🏽

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the latest version is as follows:

We have now metrics from runtime/metrics with new non-colliding naming as well as old metrics with old naming but actually content form new source.

I think this is great balance. Thanks for all tests and extra tooling for us to ensure changes in different go versions, epic work.

I would be happy to merge this and release, thanks for this work!

@mknyszek
Copy link
Contributor Author

mknyszek commented Jan 19, 2022

Looks like the latest version is as follows:

We have now metrics from runtime/metrics with new non-colliding naming as well as old metrics with old naming but actually content form new source.

I think this is great balance. Thanks for all tests and extra tooling for us to ensure changes in different go versions, epic work.

I would be happy to merge this and release, thanks for this work!

Thanks for merging it! :) I'm glad you like where it ended up, I'm happy with it too.

vadimberezniker added a commit to buildbuddy-io/buildbuddy that referenced this pull request Jun 14, 2022
According prometheus/client_golang#955, the
newer version uses a more efficient API to get data from the go runtime.
vadimberezniker added a commit to buildbuddy-io/buildbuddy that referenced this pull request Jun 16, 2022
According prometheus/client_golang#955, the
newer version uses a more efficient API to get data from the go runtime.
@dnwe dnwe mentioned this pull request Jun 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants