Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span start and end times in the distant past on chrome #3355

Closed
MisterSquishy opened this issue Oct 24, 2022 · 22 comments · Fixed by #3434
Closed

Span start and end times in the distant past on chrome #3355

MisterSquishy opened this issue Oct 24, 2022 · 22 comments · Fixed by #3434
Labels
bug Something isn't working priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect signal:traces Issues and PRs related to general traces signal

Comments

@MisterSquishy
Copy link
Contributor

MisterSquishy commented Oct 24, 2022

What happened?

Steps to Reproduce

call trace.getTracer('service').startSpan(...) without setting startTime from an instance of Chrome on version 106.0.5249.119 (i'm not sure exactly how far back this open bug goes, but it's definitely in that version)

Expected Result

spans emitted with start and end times reflecting the time the spans were actually started and ended

Actual Result

spans emitted with start and end times from days/weeks ago

Additional Details

this seems to be caused by a chrome bug affecting performance.timeOrigin; obviously the ideal thing here would be for chrome to just fix this bug. but wanted to bring this to your attention in case you weren't aware.

would you recommend manually setting startTime when starting a span (using Date.Now()) for the time being? Or are we better off awaiting a fix?

OpenTelemetry Setup Code

import { trace } from '@opentelemetry/api';
const span = trace.getTracer('service').startSpan('test_span') as Span;
span.end();

package.json

No response

Relevant log output

the above code emits spans that look like this (emitted around unix timestamp 1666632162339000000):

[
    {
        ...
        "instrumentationLibrarySpans": [
            {
                "spans": [
                    {
                        ...
                        "startTimeUnixNano": 1666297705002200000,
                        "endTimeUnixNano": 1666297705489300000,
                        ...
                    },
                    ...
        ]
        ...
    }
]
@MisterSquishy MisterSquishy added bug Something isn't working triage labels Oct 24, 2022
@legendecas
Copy link
Member

Would you mind sharing the version of the OpenTelemetry SDK you are using? Thank you!

@legendecas legendecas added the signal:traces Issues and PRs related to general traces signal label Oct 25, 2022
@MisterSquishy
Copy link
Contributor Author

i'm on 1.0.0-rc.0

this seems like it might be related to #3279 (comment), but to be honest im not totally sure i understand that issue. i'm not really looking for a supported way to get anchored clock times, i just want my spans to have accurate start and end times.

@oliverswitzer
Copy link

oliverswitzer commented Oct 25, 2022

Can confirm that I am also experiencing this with @opentelemetry/api @ v1.0.1 being used in the browser on Chrome.

It seems as though passing in startTime: new Date() like so tracer.startSpan(name, { root: true, startTime: new Date() }) does not override the inaccurate date time.

Is there some other way to do this that I'm missing in the documentation?

This date time is correct in Safari.

@Flarna
Copy link
Member

Flarna commented Oct 25, 2022

It's the version of SDK which matters here, not the API version.

One can provide timestamps to start and end. The span duration is calculated by endTime - startTime.
If the two timestamps use a different time source the duration is likely garbage.

So either pass no timestamp to start/end or the a timestamp based on the same source to both of them to get consistent durations. You may still see time shifted spans because of the mix of time sources which may drift apart.

There were issues with using performance as time source as it turned out that e.g. time was wrong after notebook went to sleep for a while. #3134 was intended to fix this by using additionally Date.now() as an anchor for spans.
But this change has shown issues for cases where people provide there own timestamp to start and/or end.

Unfortunately there are also instrumentations like @opentelemetry/instrumentation-fetch which call start/end with their own timestamp.

#3279 discusses a solution for this problem by providing an API to read the anchored clock and allow users to take timestamps matching to that one create by SDK internally.

fyi @dyladan

@oliverswitzer
Copy link

@Flarna thanks for that info. Apologies, looks like I was depending on:

    "@opentelemetry/tracing": "~0.22.0",
    "@opentelemetry/web": "~0.22.0"

Which have seemingly been replaced by:

    "@opentelemetry/sdk-trace-base": "^1.7.0",
    "@opentelemetry/sdk-trace-web": "^1.7.0",

Since I last worked on this project... Is that correct to assume?

It seems that switching to these libraries cause the inaccurate timestamp issue to go away for me

@MisterSquishy
Copy link
Contributor Author

ah sorry for the delay @Flarna -- super helpful context. i have the same SDK dependencies as @oliverswitzer, but haven't attempted an upgrade. it seems that #3134 was reverted, so is it safe to say this is still an open issue?

is it worth keeping this open if it's so closely related to #3279? i just worry that that latter issue isnt quite as discoverable for someone who's just observing the symptoms of this bug.

@dyladan dyladan added duplicate and removed triage labels Nov 2, 2022
@dyladan
Copy link
Member

dyladan commented Nov 2, 2022

i'm on 1.0.0-rc.0

I do not recommend that version. All the -rc.x versions may have major compatibility issues with the stable APIs

It seems that switching to these libraries cause the inaccurate timestamp issue to go away for me

The inaccurate timestamp issue @Flarna mentioned was actually introduced in 1.7.0. Not sure what happened in 0.22.x but it is not the same. In any case, I would recommend updating packages. If the timestamp bug is affecting you then you can either hold to 1.6.0 or you can wait for the 1.8.0 release which should be this week (today?).

it seems that #3134 was reverted, so is it safe to say this is still an open issue?

Yes it was reverted but the fix is not yet released

is it worth keeping this open if it's so closely related to #3279? i just worry that that latter issue isnt quite as discoverable for someone who's just observing the symptoms of this bug.

In this case I wish github issues had better issue linking. I added the duplicate label. I'm fine closing or leaving open.

@dyladan dyladan added the priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect label Nov 2, 2022
@MisterSquishy
Copy link
Contributor Author

which timestamp bug will be fixed in 1.8.0? sounds like there are a couple:

  • 1.6 and earlier: inaccurate span start/end timestamps caused by performance.timeOrigin being wrong sometimes/in certain browsers (described in this issue)
  • 1.7: timestamps are accurate, but durations can be inaccurate due to feat: anchored clock #3134 (described in Provide a supported way to get anchored clock times #3279)
  • 1.8: timestamps and durations are both accurate? or does 1.8 just revert to the pre-1.7 behavior (relying solely on performance to get start/end times)?

@dyladan
Copy link
Member

dyladan commented Nov 3, 2022

1.8 will revert to <=1.6 behavior as it was causing less (but not 0) problems

The long-term fix is tricky because some timestamps are provided by the instrumentation author from the performance API. When span.end is provided with a timestamp there is no way to know if it was provided by the performance timing API or from Date.now (or some other clock).

@MisterSquishy
Copy link
Contributor Author

Gotcha. Sounds like the best avenue for now is to provide explicit start and end times using Date.now() whenever possible. I think it makes sense to leave this issue open; the feature requested in #3279 represents one way to solve the bug described in this issue, so they're kinda different but closely related.

(as an aside that y'all are likely aware of, I'm also noticing the same behavior for metrics, I assume the anchored clock would fix that as well)

@dyladan
Copy link
Member

dyladan commented Nov 4, 2022

metrics should likely just be changed to use Date.now as there is really no benefit to the monotonic clock in that context

@MisterSquishy
Copy link
Contributor Author

ah interesting -- worth a separate issue? that sounds like a pretty easy fix, maybe i could just do that one

@dyladan
Copy link
Member

dyladan commented Nov 4, 2022

@MisterSquishy maybe a separate issue yes, but i wouldn't work on it until there is consensus with the other maintainers. We may also want to introduce a global clock api which uses Date by default but can be swapped with some user clock implementation (important for nontraditional runtimes)

@MisterSquishy
Copy link
Contributor Author

ah fair enough -- ill raise the issue and see where it goes!

@yuanman0109
Copy link

It's the version of SDK which matters here, not the API version.

One can provide timestamps to start and end. The span duration is calculated by endTime - startTime. If the two timestamps use a different time source the duration is likely garbage.

So either pass no timestamp to start/end or the a timestamp based on the same source to both of them to get consistent durations. You may still see time shifted spans because of the mix of time sources which may drift apart.

There were issues with using performance as time source as it turned out that e.g. time was wrong after notebook went to sleep for a while. #3134 was intended to fix this by using additionally Date.now() as an anchor for spans. But this change has shown issues for cases where people provide there own timestamp to start and/or end.

Unfortunately there are also instrumentations like @opentelemetry/instrumentation-fetch which call start/end with their own timestamp.

#3279 discusses a solution for this problem by providing an API to read the anchored clock and allow users to take timestamps matching to that one create by SDK internally.

fyi @dyladan

Does the next 1.8.0 version solve this problem? After I update to the latest version, the time will still pass. Do I need to set the time manually?

@Flarna
Copy link
Member

Flarna commented Nov 14, 2022

1.8.0 is like >=1.6.0. Only 1.7.0 was different.

The change included in 1.7.0 solved the problem of drifting performance clock but it introduced problems for locations where timestamps were given manually (most browser instrumentations) therefore it was reverted (#3359).

There is #3384 to improve this in upcoming release. Any input there is welcome, maybe you can even test it in your setup already now.

Do I need to set the time manually?
This option will always work but it restricts you from using any 3rd party instrumentation as they either rely on SDK or use their own time source which may be a different one then yours.
But if you have all start/end spans in your control you can choose the your preferred, optimized time source.

@tl-sebastien-kovacs
Copy link

tl-sebastien-kovacs commented Nov 18, 2022

Thanks for the investigation team,

I can see that this issue is linked to a chromium ticket, do we know versions of chrome where it's not an issue ?
I did upgrade to chrome 107.0.5304.110 and seeing the same problem.

Also maybe stating the obvious (as it's linked to chrome implementation) but from local testing, Firefox and Safari don't seem to be affected by this problem.

@SimenB
Copy link
Contributor

SimenB commented Nov 30, 2022

FWIW, Chrome 108 was released 9 hours ago with a fix - from my testing the timings aren't broken anymore.

EDIT: Might have misread the bug tracker and it's currently slated for 109. But we are seeing less of this bug from 108. Might be confirmation bias, but I don't see any traces from a quick search broken with 109.

@dyladan
Copy link
Member

dyladan commented Nov 30, 2022

@SimenB that sounds great do you have a link to the bug? Even if it is fixed in chrome, it still affects other browsers (to an admittedly much lesser extent) and faas services like lambda.

edit: i see the link in description

@t2t2
Copy link
Contributor

t2t2 commented Dec 1, 2022

FWIW, Chrome 108 was released 9 hours ago with a fix - from my testing the timings aren't broken anymore.

Also possible that as the new version had been just released there's less chances for drift to accumulate. One of the ways the drift is known to start is when the computer goes to sleep, eg most usual:

  1. work on laptop
  2. once done with work close laptop lid (-> laptop goes to hibernate)
  3. next day reopen laptop - performance api is drifted, reporting times as if it's still at right after hibernation

9 hours of in the wild is definitely too early to see clear patterns like above. Also the drift goes away at browser restart, which means updating the browser would clear the drift. I'd more likely assume 109 would be the release as per chromiumdash linked in crbug issue

That being said idk if it's a full fix, this seems like a rollback of "Unix epoch / monotonic clock should be sync across Performance objects" (introduced in 105), but I've known about sleeping laptop bug for years (back in Plumbr's RUM which is from about 2017) - only then I came to conclusion that opening a new tab or even just hardnav (non-SPA page navigation) would fix the drift, so now I was surprised that this bug was now persisting across navigations and into new tabs - and this would explain why

@yuanman0109
Copy link

Have you made any progress on this issue recently? At present, because span is always in the past, I can only force changes in the custom exported hooks

@RazGvili
Copy link
Contributor

RazGvili commented Dec 27, 2022

I am having the same issues testing the web-sdk.
Since my timeOrigin is on Dec 11, spans startTime is way off.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect signal:traces Issues and PRs related to general traces signal
Projects
None yet
10 participants