Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporterhelper] fix missed metric aggregations #9048

Merged

Conversation

codeboten
Copy link
Contributor

This PR ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in Add (see https://github.com/open-telemetry/opentelemetry-go/blob/6cee2b4a4c76b581115d0d0ca150ad8b2e683db6/sdk/metric/instrument.go#L241-L243). This means that if there's a cancelled context in an export, the metrics are not recorded.

This PR ensures that context cancellation in the exporter doesn't interfere with metric aggregation. The OTel SDK currently returns if there's an error in the context used in `Add` (see https://github.com/open-telemetry/opentelemetry-go/blob/6cee2b4a4c76b581115d0d0ca150ad8b2e683db6/sdk/metric/instrument.go#L241-L243). This means that if there's a cancelled context in an export, the metrics are not recorded.

Signed-off-by: Alex Boten <aboten@lightstep.com>
@codeboten codeboten requested a review from a team as a code owner December 5, 2023 19:05
@codeboten codeboten requested a review from mx-psi December 5, 2023 19:05
…g this when 1.22 is out

Signed-off-by: Alex Boten <aboten@lightstep.com>
Copy link

codecov bot commented Dec 5, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f3a241a) 91.57% compared to head (f398d19) 91.57%.
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9048   +/-   ##
=======================================
  Coverage   91.57%   91.57%           
=======================================
  Files         316      316           
  Lines       17147    17147           
=======================================
  Hits        15702    15702           
  Misses       1150     1150           
  Partials      295      295           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -154,7 +154,7 @@ func (or *ObsReport) StartTracesOp(ctx context.Context) context.Context {
// EndTracesOp completes the export operation that was started with StartTracesOp.
func (or *ObsReport) EndTracesOp(ctx context.Context, numSpans int, err error) {
numSent, numFailedToSend := toNumItems(numSpans, err)
or.recordMetrics(ctx, component.DataTypeTraces, numSent, numFailedToSend)
or.recordMetrics(context.Background(), component.DataTypeTraces, numSent, numFailedToSend)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that we'd never need the original context inside recordMetrics?

Copy link
Contributor Author

@codeboten codeboten Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go 1.21 has a WithoutCancel func I would suggest we move to in the future (whenever 1.22 is out). Thischange removes any context (including baggage), which likely you'd want to keep

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go forward with this can we add a TODO or an issue to track that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i was planning on filing an issue to track this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added an issue #9049

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest use of context.TODO() here. I understand this solves a short-term problem, but I'd like to ask a bigger question about why we would bypass metric events on canceled contexts. It seems like a recipe for lost time and user confusion, as happened here. If we could convince the Otel-Go SDK to modify this behavior, perhaps restore the context.TODO() to the original context.

As you suggested, an alternate SDK with an extension for interpreting context will lose access to baggage, something we could fix in the long run. We have a few internal uses where exporterhelper metrics such as these are being extended from context with header values from the gRPC metadata. (Using this.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmacd i followed @fatsheep9146's suggestion and implemented withoutCancel temporarily in this code

@fatsheep9146
Copy link
Contributor

I also have met this problem when I implemented the grpc instrumentation library, here is the comment

open-telemetry/opentelemetry-go-contrib#4356 (comment)

and @pellared from sig go sdk have created an issue talked about this behavior in go sdk sig meeting open-telemetry/opentelemetry-go#4671

The conclusion is go sdk will keep the behavior. And suggest user to use WithoutCancel to escape this, and before go 1.20 we could solve like we do in grpc instrumentation library , which basically is a copy of WithCancel in go 1.21
https://github.com/open-telemetry/opentelemetry-go-contrib/blob/6009b3188da4f25eb8d2e65ffeefb3bf4b7c3415/instrumentation/google.golang.org/grpc/otelgrpc/stats_handler.go#L216-L245

@codeboten

Signed-off-by: Alex Boten <aboten@lightstep.com>
@codeboten
Copy link
Contributor Author

thanks @fatsheep9146, copied the implementation of withoutCancel in this PR

Signed-off-by: Alex Boten <aboten@lightstep.com>
@pellared
Copy link
Member

pellared commented Dec 6, 2023

@pellared from sig go sdk have created an issue talked about this behavior in go sdk sig meeting open-telemetry/opentelemetry-go#4671

I am thinking about "reviving" the issue. Especially in the context of Logs API+SDK where I do not want the same behavior for Logs SDK. I want to be inline with e.g. https://pkg.go.dev/log/slog#Handler which says:

Canceling the context should not affect record processing. (Among other things, log messages may be necessary to debug a cancellation-related problem.)

Changing the metrics SDK behavior would be a breaking change, but I think it would help 90% of the use cases and I doubt it will cause any harm to anyone. Even if so I think it may be an acceptable breaking behavioral change.

Please let me know what you think. I think that OTel Go "consumers" point of view is very important here.

Because of the current behavior we also have to deal with issues/PRs like this one: open-telemetry/opentelemetry-go-contrib#4634.

EDIT: I recreated open-telemetry/opentelemetry-go#4671. Feel free to review and leave comments. I also added the PR to the agenda for the next Go SIG meeting.

@jmacd
Copy link
Contributor

jmacd commented Dec 6, 2023

I think it would help 90% of the use cases and I doubt it will cause any harm to anyone.

Personally, I can't imagine a user who expects to skip metric events because of context cancellation. Has the group heard any user stories about that? Mainly, I agree with the comment quoted above:

(Among other things, log messages may be necessary to debug a cancellation-related problem.)

The way I think about it, context cancellation is a mechanism to stop blocking, avoid waiting, or skip useless work. So, I expect a user to inspect the context and avoid instrumented code, not to avoid instrumentation events. It is true that the aggregator could acquire a mutex, and that could briefly block the application. However, so does a call to ctx.Err() imply taking a mutex that could briefly block the application.

In my implementation, the context is last used and processed in the synchronous instrument path, before an aggregator sees the event. If the SDK wants to inspect the baggage or other contextual information--to me--it probably wants to be done at a higher level, so before the aggregators, so that context-related behavior is aggregator-independent.

@bogdandrutu bogdandrutu merged commit 034ec43 into open-telemetry:main Dec 6, 2023
32 checks passed
@github-actions github-actions bot added this to the next release milestone Dec 6, 2023
@bogdandrutu
Copy link
Member

@jmacd @pellared please move the discussion in the otel-go repo :)

@codeboten codeboten deleted the codeboten/fix-missing-metrics branch December 6, 2023 18:46
pantuza pushed a commit to pantuza/opentelemetry-collector that referenced this pull request Dec 8, 2023
This PR ensures that context cancellation in the exporter doesn't
interfere with metric aggregation. The OTel SDK currently returns if
there's an error in the context used in `Add` (see
https://github.com/open-telemetry/opentelemetry-go/blob/6cee2b4a4c76b581115d0d0ca150ad8b2e683db6/sdk/metric/instrument.go#L241-L243).
This means that if there's a cancelled context in an export, the metrics
are not recorded.

---------

Signed-off-by: Alex Boten <aboten@lightstep.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants