Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

span composite count must be > 1 #3265

Open
kruskall opened this issue Aug 2, 2023 · 2 comments
Open

span composite count must be > 1 #3265

kruskall opened this issue Aug 2, 2023 · 2 comments

Comments

@kruskall
Copy link
Member

kruskall commented Aug 2, 2023

Describe the bug

From the APM Agents spec (https://github.com/elastic/apm/blob/main/specs/agents/handling-huge-traces/tracing-spans-compress.md?plain=1#L137-L138):

  • composite
    - count: The number of compressed spans this composite span represents.
    The minimum count is 2 as a composite span represents at least two spans.

It looks like the java agent is reporting composite spans with a lower count. The following error was observed in ecs logs:

validation error: span: composite: 'count': validation rule 'min(2)' violated

Steps to reproduce

I don't have those unfortunately :(

This is part of the ops KPI review

Expected behavior

Composite count must be > 1

Debug logs

validation error: span: composite: 'count': validation rule 'min(2)' violated

@kruskall kruskall changed the title span composite count should be > 1 span composite count must be > 1 Aug 2, 2023
@JonasKunz
Copy link
Contributor

JonasKunz commented Aug 3, 2023

By looking through the code, this seems to only be possible by a very specific race-condition.
Given the following trace scenario:

  • Transaction parent
    • Span DB1 (compression egligable)
    • Span DB2 (compression egligable)
    • Span Async1 (not compression egligable)

Now the following things need to happen:

  1. DB1 is ended. It is not sent to the APM-server but buffered on parent for future arriving compressed spans
  2. DB2 and Async1 are ended in parallel at the same time
    a. DB2 initializes the compression-datastruction on DB1 this causes count=1
    b. Async1 sees that it cannot compress with DB1 and causes DB1 to be reported immediately
    c. DB2 increments the count on DB1 due to compression. This doesn't have any effect because DB1 is already sent.

I think what we could do is guard the compression process with a WriterReaderPhaser: During compression DB2 would register itself as Writer on DB1 before performing any mutations. The reporting-thread would swap the phase before reporting a span, waiting for pending compressions to end.
This however opens up additional things to consider: what do we do when a span attempts compression, but the phase of the WriterReaderPhaser indicates that the span he wants to compress with has already been sent? E.g. retry a fixed amount of times?

I'd love to have some input from @elastic/apm-agent-java (and maybe @felixbarny and @eyalkoren as original authors) here. Is it known (and accepted) that the span-compression is prone to lost-update race conditions?

As a bandaid fix to prevent spans from being rejected by the APM-server I'll add a guard to the serialization logic so that we don't report spans with count=1 as compressed.

@jackshirazi
Copy link
Contributor

Span compression is purely for improved UI and I don't think it's worth a lot of effort to handle edge cases in high contention cases other than to just drop the odd compression rather than add load. So I think the proposed band aid is sufficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants