Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monitoring: migrate from OpenCensus to OpenTelemetry #45341

Merged
merged 1 commit into from
Jul 17, 2023

Conversation

howardjohn
Copy link
Member

@howardjohn howardjohn commented Jun 8, 2023

Blocked by open-telemetry/opentelemetry-go#4306. I have a hold until that is ready; should be fairly quick

For #44743

Fixes #44513

This PR migrates from OpenCensus to OpenTelemetry for Go metrics. This
doesn't impact proxy metrics, or non-metrics observability.

These changes are intended to be 100% transparent to end users.

The changes are mostly transparent from pkg/monitoring public
interface, with a few changes:

  • Register is no longer needed. Creating the metric registers it. This
    makes no real difference in any of our usage
  • MustCreateLabel -> CreateLabel; it is infallable
  • No need to define the set of keys upfront with the metric.
  • RegisterIf -> WithEnabled
  • monitortest behaves slightly different; only metrics recorded after
    its creation are counted. A few tests are updated to handle this
    change.
  • Only one exporter can be defined; this only impacts tests, which are
    updated to handle this.

There are a number of changes, of course, to the internals. Most of this
is a fairly mechanical translation. A few quirks:

  • There is no real Gauge like in OC. Instead, we can only do a callback.
    So we need to handle a bit of this on our end; not a huge deal, and
    looks like it will be added eventually.
  • Setting histogram buckets is a bit awkward currently. This will be
    addressed eventually upstream.
  • monitortest previously was based on OC internals. This has been re-written to prometheus instead of OTEL. This is because prometheus is our primary usage, so its a bit more realistic.

This results in a 1% increase in the binary size, which is due to importing both OC and Otel. This is addressed in #46010, which introduces a 1% decrease in binary size when combined with this PR.

Recording metrics is now much cheaper. This may allow us to start enabling metrics that were previously disabled (TBD, I am planning a more thorough design of our overall observability, would prefer to avoid changes until then):

name                      old time/op    new time/op    delta
Counter/no_labels-7          295ns ± 3%      56ns ±10%   -80.91%  (p=0.008 n=5+5)
Counter/dynamic_labels-7     870ns ±26%     434ns ± 4%   -50.16%  (p=0.008 n=5+5)
Counter/static_labels-7      407ns ± 2%     122ns ± 1%   -69.98%  (p=0.029 n=4+4)

name                      old alloc/op   new alloc/op   delta
Counter/no_labels-7          68.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)
Counter/dynamic_labels-7      760B ± 0%      320B ± 0%      ~     (p=0.079 n=4+5)
Counter/static_labels-7      80.0B ± 0%      0.0B       -100.00%  (p=0.008 n=5+5)

name                      old allocs/op  new allocs/op  delta
Counter/no_labels-7           3.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
Counter/dynamic_labels-7      13.0 ± 0%       6.0 ± 0%   -53.85%  (p=0.008 n=5+5)
Counter/static_labels-7       3.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)

@howardjohn howardjohn added the release-notes-none Indicates a PR that does not require release notes. label Jun 8, 2023
@istio-testing istio-testing added the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Jun 8, 2023
@istio-testing
Copy link
Collaborator

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@istio-testing istio-testing added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 8, 2023
@zirain
Copy link
Member

zirain commented Jun 8, 2023

FYI: I test with opentelemetry-go offline, it's will add _total suffix fro counter, which maybe a regression

https://github.com/open-telemetry/opentelemetry-go/blob/4d473d105a5f59a58d153e7c492a4ea0ee027871/exporters/prometheus/exporter.go#L216

@istio-testing istio-testing added the needs-rebase Indicates a PR needs to be rebased before being merged label Jun 11, 2023
@zirain zirain mentioned this pull request Jun 13, 2023
@istio-testing istio-testing added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed needs-rebase Indicates a PR needs to be rebased before being merged size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 10, 2023
howardjohn added a commit to howardjohn/istio that referenced this pull request Jul 12, 2023
Currently this isn't really a big difference, just a minor refactoring
to use best practices and avoid global/init().

In future (istio#45341) it may be more
important.
istio-testing pushed a commit that referenced this pull request Jul 13, 2023
* agent: avoid global metrics initialization

Currently this isn't really a big difference, just a minor refactoring
to use best practices and avoid global/init().

In future (#45341) it may be more
important.

* fix tests
@howardjohn howardjohn changed the title wip: Swap out for otel SDK monitoring: migrate from OpenCensus to OpenTelemetry Jul 13, 2023
@howardjohn howardjohn added the do-not-merge/hold Block automatic merging of a PR. label Jul 13, 2023
@howardjohn howardjohn marked this pull request as ready for review July 13, 2023 21:51
@howardjohn howardjohn requested review from a team, costinm, linsun and stevenctl as code owners July 13, 2023 21:51
@istio-testing istio-testing removed the do-not-merge/work-in-progress Block merging of a PR because it isn't ready yet. label Jul 13, 2023
@howardjohn howardjohn force-pushed the exp/otel-sdk branch 2 times, most recently from 6ecac3b to 31fc60b Compare July 14, 2023 17:04
@howardjohn howardjohn removed the do-not-merge/hold Block automatic merging of a PR. label Jul 14, 2023
Copy link
Member

@zirain zirain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks @howardjohn

@@ -75,6 +75,7 @@ func TestGetFederatedToken(t *testing.T) {
tag := map[string]string{
"request_type": monitoring.TokenExchange,
}
// TODO: this is broken due to prometheus auto adding "_total"
mt.Assert("num_outgoing_retries", tag, monitortest.AtLeast(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this worth a release notes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment - its not broken anymore. Let me clean this up

For istio#44743

Fixes istio#44513

This PR migrates from OpenCensus to OpenTelemetry for Go metrics. This
doesn't impact proxy metrics, or non-metrics observability.

These changes are intended to be 100% transparent to end users.

The changes are *mostly* transparent from `pkg/monitoring` public
interface, with a few changes:
* `Register` is no longer needed. Creating the metric registers it. This
  makes no real difference in any of our usage
* MustCreateLabel -> CreateLabel; it is infallable
* No need to define the set of keys upfront with the metric.
* RegisterIf -> WithEnabled
* monitortest behaves slightly different; only metrics recorded after
  its creation are counted. A few tests are updated to handle this
change.
* Only one exporter can be defined; this only impacts tests, which are
  updated to handle this.

There are a number of changes, of course, to the internals. Most of this
is a fairly mechanical translation. A few quirks:

* There is no real Gauge like in OC. Instead, we can only do a callback.
  So we need to handle a bit of this on our end; not a huge deal, and
looks like it will be added eventually.
* Setting histogram buckets is a bit awkward currently. This will be
  addressed eventually upstream.
@costinm
Copy link
Contributor

costinm commented Jul 17, 2023

Looks very good.

One concern - not blocking this PR - is that I would prefer if labels are declared at metric creation, even if otel doesn't require it.

Copy link
Contributor

@costinm costinm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving the PR - not entirely happy, but it is clearly a huge improvement (opencensus is long deprecated and unsupported).

I think we need to follow up - and remove the wrappers, if we use otel let's use the otel API without an istio abstraction that just confuses people. And add back some info about the labels for each metris - in the description or somewhere, we also probably need to generate docs for each metric.

All of this can be done in separate PRs and incrementally.

I would also add some doc - and make it clear if indeed the intent is to start using otel API and if it is ok for new metrics ( and gradually for the old ones ) to replace the use of the wrapper with the proper API.

@@ -26,16 +26,10 @@ var (
cniInstalls = monitoring.NewSum(
"istio_cni_installs_total",
"Total number of CNI plugins installed by the Istio CNI installer",
monitoring.WithLabels(resultLabel),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not remove this part ( maybe refactor it to avoid the MustCreateLabel - but having a list of labels is very informative, and will still be required if we want monitoring to have a prometheus implementation.

If we want OTel only ( which I don't mind ) - we don't need an abstraction/wrapper on top of the SDK. But even then it would be good to keep track of what labels are generated, it is probably the most critical aspect of a metric.


// ReconcileRequestReasonLabel describes reason of reconcile request.
ReconcileRequestReasonLabel = monitoring.MustCreateLabel("reason")
ReconcileRequestReasonLabel = monitoring.CreateLabel("reason")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are they used - I see WithLabels is removed.


ReconcileRequestTotal,
)

initOperatorCrdResourceMetrics()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it do - isn't it another must register ?

}

func initializeMonitoring() (*prometheus.Registry, error) {
func initializeMonitoring() (prometheus.Gatherer, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're doing this - shouldn't agent also use otel sdk ? Not in this PR, of course - but a TODO and comment explaining why prometheus is used directly would help other developers.

@@ -19,20 +19,15 @@ import (
)

var (
typeTag = monitoring.MustCreateLabel("type")
eventTag = monitoring.MustCreateLabel("event")
typeTag = monitoring.CreateLabel("type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not entirely sure why this was renamed - either remove or leave it as is, we can live with 'Must' for a while...

@istio-testing istio-testing merged commit eb0f84c into istio:master Jul 17, 2023
27 checks passed
@ra-grover ra-grover mentioned this pull request Jul 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes-none Indicates a PR that does not require release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

telemetry: move to OpenTelemetry lib
6 participants