Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add warning log on dropped spans in batch span processor #3289

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

prodion23
Copy link

This aims to add warning logging during batch span processor dropping spans.

#3264

One thing I'm unsure about is the logger doesn't offer a Warn method already, I added one here - however, unsure if we want to keep that change.

https://github.com/open-telemetry/opentelemetry-go/blob/main/internal/global/internal_logging.go#L45-L63

Maybe the info log should use a low priority number so that we have somewhere to fit warn into this?

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 14, 2022

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: prodion23 / name: Donnie (c97924f)

Copy link
Member

@Aneurysm9 Aneurysm9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add changelog entries for the new log level and dropped span handling. Tests for the new log entries will also be required.

sdk/trace/batch_span_processor.go Outdated Show resolved Hide resolved
@prodion23
Copy link
Author

@Aneurysm9 thank you for the feedback here.

One thing to draw specific scrutiny towards, unfortunately we can't use the WithBlocking setting for this test because it never increments drop counts on that code path.

https://github.com/prodion23/opentelemetry-go/blob/0052a23c290951aeeae8c6adef5c6219d34d8616/sdk/trace/batch_span_processor.go#L362-L363

https://github.com/prodion23/opentelemetry-go/blob/0052a23c290951aeeae8c6adef5c6219d34d8616/sdk/trace/batch_span_processor.go#L362-L363

During the test, it consistently seems to flake if I run it with a small genNumSpans argument, essentially anything less than 10 would fail 50% of the time.

Setting it to 100 has consistently passed but understand if there is concern with using a known flappy configuration.
With genNumSpans set to 100 the log statement shows that between ~80 to ~90 spans are dropped.

Happy to revisit this test if you have suggestions.

@Aneurysm9 Aneurysm9 added pkg:SDK Related to an SDK package area:trace Part of OpenTelemetry tracing blocked:CLA Waiting on CLA to be signed before progress can be made labels Oct 18, 2022
@Aneurysm9
Copy link
Member

I think using the non-blocking-on-full-queue mode for this test is probably appropriate, but would love to get the opinion of @open-telemetry/go-approvers. This component has a long history of flaky tests, but it looks like the test you have proposed should be fairly reliable.

Please ensure that all commits are associated with a user that has signed the CLA as we will be unable to proceed further (and some approvers may not review the PR) until the CLA is signed.

@Aneurysm9 Aneurysm9 removed the blocked:CLA Waiting on CLA to be signed before progress can be made label Oct 21, 2022
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
sdk/trace/batch_span_processor.go Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
sdk/trace/batch_span_processor_test.go Outdated Show resolved Hide resolved
sdk/trace/batch_span_processor_test.go Outdated Show resolved Hide resolved
sdk/trace/batch_span_processor_test.go Outdated Show resolved Hide resolved
sdk/trace/batch_span_processor_test.go Outdated Show resolved Hide resolved
sdk/trace/batch_span_processor_test.go Outdated Show resolved Hide resolved
sdk/trace/batch_span_processor_test.go Outdated Show resolved Hide resolved
@MrAlias
Copy link
Contributor

MrAlias commented Nov 4, 2022

It looks like the added test is quite flaky or does not succeed on windows: https://github.com/open-telemetry/opentelemetry-go/actions/runs/3380083340/jobs/5613037464

We need our testing system to be solid and therefore this needs to be refactored or redesigned to ensure tests pass.

Signed-off-by: Aaron Clawson <3766680+MadVikingGod@users.noreply.github.com>
@MadVikingGod
Copy link
Contributor

I updated the test to make it more robust. @MrAlias and @Aneurysm9 can you please review to see if this would be acceptable?

I didn't test in windows, but I found the old version would often fail if ran with go test -run ^TestBatchSpanProcessorLogsWarningOnNewDropSpans$ -count 50, but this version doesn't.

Signed-off-by: Aaron Clawson <3766680+MadVikingGod@users.noreply.github.com>
@codecov
Copy link

codecov bot commented Nov 30, 2022

Codecov Report

Merging #3289 (9af553a) into main (291aaa0) will increase coverage by 0.2%.
The diff coverage is 100.0%.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##            main   #3289     +/-   ##
=======================================
+ Coverage   78.0%   78.2%   +0.2%     
=======================================
  Files        165     165             
  Lines      11755   11762      +7     
=======================================
+ Hits        9177    9207     +30     
+ Misses      2382    2359     -23     
  Partials     196     196             
Impacted Files Coverage Δ
sdk/trace/batch_span_processor.go 82.0% <100.0%> (+0.9%) ⬆️
sdk/resource/resource.go 83.5% <0.0%> (+5.7%) ⬆️
sdk/trace/provider.go 95.5% <0.0%> (+6.2%) ⬆️

@MrAlias
Copy link
Contributor

MrAlias commented Dec 1, 2022

Looks like the tests are still failing: https://github.com/open-telemetry/opentelemetry-go/actions/runs/3586714280/jobs/6036245937

I'm more in favor of stable testing than accepting these changes as is.

Moved the functionality of the testing stuckExporter into the blockingExporter.
@MadVikingGod
Copy link
Contributor

I'm more in favor of stable testing than accepting these changes as is.

Me too, the problem here is to observe dropped spans we can't use the blocking BSP, because it never drops any spans.

@alawrenc
Copy link

Looks like the tests are still failing: https://github.com/open-telemetry/opentelemetry-go/actions/runs/3586714280/jobs/6036245937

I'm more in favor of stable testing than accepting these changes as is.

The test race appears to be fixed by 9af553a. Are there more changes required before this can merge?

reportedDropped := atomic.SwapUint32(&bsp.reportedDropped, dropped)
if dropped > reportedDropped {
droppedThisBatch := dropped - reportedDropped
global.Info("dropped spans", "total", dropped, "this_batch", droppedThisBatch)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not call global.Warn?

Suggested change
global.Info("dropped spans", "total", dropped, "this_batch", droppedThisBatch)
global.Warn("dropped spans", "total", dropped, "this_batch", droppedThisBatch)

@pellared
Copy link
Member

Are there more changes required before this can merge?

"This branch has conflicts that must be resolved" 😉

@pellared
Copy link
Member

@prodion23 Are you planning to refresh the PR and address the comment(s)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:trace Part of OpenTelemetry tracing pkg:SDK Related to an SDK package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants