Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tracing): make spans resilient to performance clock drift #3434

Merged
merged 31 commits into from
Jan 11, 2023

Conversation

dyladan
Copy link
Member

@dyladan dyladan commented Nov 21, 2022

This is a simpler alternative to #3384

Described by the comment here: #3279 (comment)

Fixes #3355

@dyladan dyladan requested a review from a team as a code owner November 21, 2022 18:14
@dyladan dyladan force-pushed the simple-fix-timestamps branch 2 times, most recently from abef3d2 to 5d8dd46 Compare November 21, 2022 18:18
@codecov
Copy link

codecov bot commented Nov 21, 2022

Codecov Report

Merging #3434 (830c799) into main (2dcc898) will increase coverage by 0.02%.
The diff coverage is 96.96%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3434      +/-   ##
==========================================
+ Coverage   93.78%   93.80%   +0.02%     
==========================================
  Files         249      249              
  Lines        7621     7637      +16     
  Branches     1588     1589       +1     
==========================================
+ Hits         7147     7164      +17     
+ Misses        474      473       -1     
Impacted Files Coverage Δ
packages/opentelemetry-sdk-trace-base/src/Span.ts 98.58% <95.65%> (-1.42%) ⬇️
...s/opentelemetry-instrumentation-fetch/src/fetch.ts 97.02% <100.00%> (+0.01%) ⬆️
...emetry-instrumentation-xml-http-request/src/xhr.ts 97.59% <100.00%> (+0.01%) ⬆️
packages/opentelemetry-core/src/common/time.ts 98.55% <100.00%> (+2.96%) ⬆️
...-trace-base/src/platform/node/RandomIdGenerator.ts 93.75% <0.00%> (+6.25%) ⬆️

return epochMillisToHrTime(inp.valueOf());
}

if (isTimeInputHrTime(inp)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version seems to be susceptible to drift in fetch instrumentation (undefined start + hrTime end):

image

Because hrTime() does getTimeOrigin() + performance.now() it would generate drifted HrTime

Copy link
Member Author

@dyladan dyladan Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because an HrTime formatted timestamp is not corrected in any case. Currently, the start time also uses the performance timing API indirectly because a start time is not provided and the performance clock is used. This means that in the current state the whole span is shifted anyway.

My recommendation would be to:

  1. Change core.hrTime to return an HrTime generated using Date.now
  2. Update fetch instrumentation to not pass a time manually

Either of these changes would fix the issue here, but I actually recommend we do both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative to (2) would be to update fetch instrumentation to get start and end times from the performance timing API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t2t2 does the updated version address your concern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fetch and xhr have both been updated. I think this is resolved but I'll wait for @t2t2 to confirm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@t2t2
Copy link
Contributor

t2t2 commented Nov 22, 2022

like previous pr, codesandbox with the changes, artificial drift and most common instrumentations (with already fixed the _performanceOffset issue in first comment)

@legendecas
Copy link
Member

The test (Span # should have an entered time for event) is unstable with the change.

@dyladan
Copy link
Member Author

dyladan commented Jan 9, 2023

The test (Span # should have an entered time for event) is unstable with the change.

Yeah I see that. I'm trying to work on it now. I think it was related to changing span end to be a Date

edit: I see the test relates to events not end. Still looking into it

@dyladan
Copy link
Member Author

dyladan commented Jan 9, 2023

Looks like the test was providing time 0 to startTime for the span which caused the span start to use Date.now to generate a start time. It then provided 123 to the event time which caused a drift correction. The test was flaky because it depended on the drift correction being millisecond accurate.

@dyladan
Copy link
Member Author

dyladan commented Jan 9, 2023

@legendecas should be stable now

@dyladan
Copy link
Member Author

dyladan commented Jan 9, 2023

@t2t2 I updated the web instrumentations to use Date instead of performance.now to get span end times which I think should address your concerns. I would appreciate your review because you've been very helpful. Hopefully we can get this merged soon.

Copy link
Member

@blumamir blumamir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic LGTM. Added a few non-blocking comments.

It seems there are so many logical branching here and was wondering if the test coverage is sufficient and covers all the potential permutations.

Each time can be of 5 types: undefined, Date.now(), number epoch, number performance now and hrTime. A span receives times via span constructor, event, and end. I think some of the common combinations or those that are specifically addressed by code can benefit from a dedicated unit test.

packages/opentelemetry-sdk-trace-base/src/Span.ts Outdated Show resolved Hide resolved
@@ -203,15 +227,41 @@ export class Span implements api.Span, ReadableSpan {
this._spanProcessor.onEnd(this);
}

private _getTime(inp?: TimeInput): HrTime {
if (typeof inp === 'number' && inp < otperformance.now()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in core we have similar logic where we use origin for this check, and not now

    // Must be a performance.now() if it's smaller than process start time.
    if (time < getTimeOrigin()) {
      return hrTime(time);

I believe they both work fine, but better to have consistency in the project, even maybe extract this logic into a function isTimeInputPerformanceNow(time: TimeInput) similar to the existing function isTimeInputHrTime, this is also self documenting and will make the comment below redundant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is actually very important to the functioning of the PR. I think the function in core should be deprecated.

The reason to use performance.now is that we can be much more sure that a given number is a performance timestamp. A number before timeOrigin may be ingested from historical logs or come from an inconsistent time source which has been corrected and timeOrigin is wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks for the response.
My point is that we are doing the same computation in 2 different places in 2 different ways. This comment is a suggestion to unify it (and maybe extract the logic into a function while we are at it). Could be done in this PR, later one, or not at all

packages/opentelemetry-core/src/common/time.ts Outdated Show resolved Hide resolved
packages/opentelemetry-core/src/common/time.ts Outdated Show resolved Hide resolved
packages/opentelemetry-core/src/common/time.ts Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Span start and end times in the distant past on chrome
6 participants