Sleeping computer causes span timings to be incorrect in browser #852

dyladan · 2020-03-12T15:57:16Z

While debugging an issue for a user here https://gitter.im/open-telemetry/opentelemetry-node?at=5e6a4b54d17593652b7c8154 it was found that while a computer is slept or hibernated, the performance.now() monotonic clock may be paused. This causes the assumption that performance.timeOrigin + performance.now() ~= Date.now() to be incorrect by some arbitrary amount of time which may be hours or days.

The text was updated successfully, but these errors were encountered:

dyladan · 2020-03-13T13:48:16Z

Originally reported by @nvolker

mayurkale22 · 2020-03-13T17:26:03Z

@draffensperger @obecny would like to get your expertise on this.

dyladan · 2020-03-13T17:29:53Z

One solution would be to use Date.now() to capture startTime, but use performance.now() to calculate the duration of the span. This would give us still high fidelity span durations while mitigating this issue. One problem with this is that Date.now() trusts the system clock, so if the system clock ever changes, Date.now() will use the updated time. performance.timeOrigin and performance.now() are monotonic and ignore changes to the system clock, which makes them ideal for calculating durations.

Flarna · 2020-03-16T13:17:39Z

Is this only effecting browsers or also node? As far as I know node uses also performance.timeOrigin captured at startup therefore I would expect a similar behavior. If it works in node we could take a look how they solved it there.

obecny · 2020-03-16T13:31:54Z

can you provide more details. Are spans still being created when computer sleeps or how does this really affect that ?

dyladan · 2020-03-16T13:40:43Z

sleep you computer for a few hours. any spans created after you wake up have timestamps that show you a few hours in the past because the clock paused. This behavior continues until the browser is restarted.

obecny · 2020-03-16T13:42:49Z

and what system / laptop has that - as I'm working on mac and my computer sleeps every day, browser is not restarted for weeks - I think I would notice that before hmm.

dyladan · 2020-03-16T13:47:49Z

The machine which had the problem was a mac using chrome. Possibly this is a sleep v hibernate issue?

obecny · 2020-03-16T13:47:55Z

ok I see it now too
new Date(performance.timeOrigin + performance.now()) is behind real time

obecny · 2020-03-16T13:49:15Z

maybe we should raise this as a chrome bug ?

draffensperger · 2020-03-16T13:50:09Z

This issue of someone sleeping the computer, or leaving the tab open to go get coffee, etc. is sort of a basic limitation of measurement in the browser world. If a user starts an operation, shuts their computer, and then comes back 8 hours later and it finishes - then how long did the operation take?

Here are a few ideas for how we could deal with this:

Provide some documentation that this suspected to be a known issue and we suggest that people ignore large outlier spans from the client in their debugging (this is the easiest approach)
Attempt to do some detection in the web library itself of when the user closes the tab or moves to a different tab (see https://stackoverflow.com/questions/3888902/detect-browser-or-tab-closing), and do some handling of that, e.g. add a special tag to all spans that the browser tab was switched and possibly even end and send what spans we currently have (not sure on that).
Record both the high-resolution timestamp of spans (using performance.now) to make sure we aren't losing accuracy, but also record the Date.now times, and then we can do some correction computations before spans are written and if the times don't cohere, we can make a best-effort fix, and possibly add some special tag to the span to indicate that it had this monotonic clock issue.

I think it would be worth us digging more into the browser specs, docs, etc. to better understand this Chrome on Mac behavior and what was intended by the authors for performance.now. Maybe it is a Chrome bug.

When I run new Date(performance.timeOrigin+performance.now()).getTime() - new Date().getTime() in my browser, I seem to get 0.

obecny · 2020-03-16T14:26:30Z

When I run new Date(performance.timeOrigin+performance.now()).getTime() - new Date().getTime() in my browser, I seem to get 0.

just close laptop for 30 seconds, open and run it, seems like a valid bug

obecny · 2020-03-16T14:30:00Z

I think all we have to do is to keep the performance, but then when we export spans we should add delta.
const delta = new new Date().getTime() - Date(performance.timeOrigin+performance.now()).getTime()

dyladan · 2020-03-16T14:31:54Z

I think you would have to add the delta at the time that you generate the timestamp. if you do it at export time the whole span will shift if it is started before the computer sleeps

dyladan · 2020-03-16T15:04:00Z

@obecny do you know if this affects the CORS preflight timing?

obecny · 2020-03-16T15:11:17Z

@dyladan this might effect all, but I think we should create an issue to investigate what is affected and then fix everything, I can take it after finishing grpc

Flarna · 2020-03-18T22:35:08Z

I tried NodeJS 13.11.0 and Firefox 74.0 on Windows 10 on my Dell E7440.
Both show no issue in case of sleep but hibernate seems to be a problem. The strange detail is that the time difference I see after 30 seconds hibernate is about 6.5 seconds and after 60 secs hibernate is's once again around 6.5 seconds.

I would not expect that NodeJs processes for servers are executed on notebooks going to hibernate frequently but AWS lambda suspends/freezes the containers used.

dyladan · 2020-03-18T22:55:52Z

This is a difficult thing to test in lambda as you have so little control over the sleep

pauldraper · 2020-03-24T15:02:37Z

This means there is simply no way to get true wall time on those platform.

These have the same appearance:

Start span, sleeping for 30 minutes, and end.
Start span, adjust system time forward 30 minutes, and end.

Though the first is 30 minutes wall time and second is not.

I think it would be worth us digging more into the browser specs, docs, etc. to better understand this Chrome on Mac behavior and what was intended by the authors for performance.now. Maybe it is a Chrome bug.

IDK, but it's a common issue. Go has the same problem.

dyladan · 2020-03-24T15:08:09Z

In opentelemetry JS we use the performance timer which is not affected by adjustments to the system clock, so (2) would have the correct duration.

dyladan · 2020-03-25T17:38:31Z

Here are the options so far:

Document this as a known issue and move on
Each time a timestamp is requested, detect if the performance timer has diverged significantly from the system clock and apply a correction
Record startTime using the system clock, and calculate duration using the performance timer.

For what should be obvious reasons, I do not think we should go with (1).

On the call today we discussed options 2 and 3.

Option 2 has the advantage of having timings always accurate according to the system clock, but if the system clock changes during a span the timings could be incorrect.

Option 3 has the advantage of using a monotonic high resolution clock for duration independent from system clock changes, but may report incorrect span timings if the performance timer pauses during a span.

pauldraper · 2020-03-25T19:23:37Z

Aren't (1) and (3) the same?

dyladan · 2020-03-25T21:19:10Z

No. With (1) your spans may show as the whole span being in the past even if the system clock is correct, and the duration could be incorrect if the performance clock pauses during the span. With (3) you will get the span start time as the correct time (if the system clock is correct), but the duration could be incorrect if the performance timer pauses during the span.

draffensperger · 2020-03-26T14:58:45Z

My suggestion would be that we try to keep the precision of the monotonic clock where possible, both for all aspects of the span, it's start/end times and it's timed events. This is because it can help people troubleshoot networking issues, or short timings in a UI to have that kind of accuracy (e.g. how long did it take to set up an SSL connection, how long did it take to render these frames, etc.)

One idea would be to focus first on detecting the condition and annotating spans about it. That way at least we are beginning to communicate the challenge to users in such a way that they could tell how often it's happening in their own data. Then in a basic way they could just exclude those spans from their queries, etc.

Then as the next phase once we have a better understanding of how often this happens and how exactly the different browser clocks relate, we could do some more effective mitigations.

Just a thought!

dyladan · 2020-03-26T15:13:55Z

@draffensperger so you're suggesting allow the span timings to be incorrect, but maybe annotate spans with the offset when the system clock diverges from the performance timer by more than a second or so?

draffensperger · 2020-03-26T15:21:11Z

Correct - my suggestion would be for version 1 of this work, we just detect when the spans timings are probably incorrect and we add some extra labels to the span to indicate how we think it's incorrect. For starters it could just be a likelyClientPause=true label that people could just filter out in their queries to exclude those spans that exhibit this behavior.

Maybe version 1.1 of the idea would be to add labels that communicate what we think about the pause e.g. likelyClientPauseStart and likelyClientPauseEnd or something like that.

Then in version 2, once we better understand how often this happens, and maybe do some experimentation with the performance.now clock, maybe write some design doc / RFC thing, etc. and get a solid understanding of the limitations of different browsers, time changes, users adjusting their time, etc. then we actually write code to do the correction.

I like the idea of shifting the span forward, but it feels tricky to do that well. What if the first part of a larger operation triggers an API call that finishes, but the second part (that they did after their coffee break and resumed their computer) triggers a second API call. Shifting the span forwards or backwards would dislocate it from the backend API spans that it's associated with. So I think it's hard to really win here.

But the idea would be that with the version 2, some mitigation strategy, we would still annotate every thing we know about the span as best we can. E.g. add a clientDateStartTime and clientDateEndTime labels that capture the Date.now values of the span start and end, but then the actual span start and end times might be shifted by whatever algorithm we came up with.

dyladan · 2020-05-05T13:45:18Z

I think another user is being affected by this. This time in a kubernetes setup. It seems like for some reason the performance timer may not be reliable in his setup. I'm not familiar enough with kubernetes to know why this might be behind the scenes. Date.now() is reporting correct timestamps in his pod (currently 1588685810545), but let performance = require("perf_hooks").performance; console.log(performance.now() + performance.timeOrigin) is returning an incorrect timestamp (currently 894664825.411137).

/cc @romilpunetha

jchamberlain · 2020-12-21T21:50:54Z

Hey all, wanted to chime in since I've spent several hours highly confused as to why I couldn't see query my spans and it sounds like this might be why.

It looks to me like performance.timeOrigin is scoped per page load in Chrome, but per browser load in Firefox. That is, opening a new tab or refreshing the page gets me a new value for performance.timeOrigin in Chrome, but not in Firefox. (I'm on Firefox 84 on Mac.)

Until a few minutes ago, I hadn't restarted Firefox for a couple weeks, but had put my computer to sleep every night. As a result, all my OpenTelemtry traces had timestamps from Dec 11, and I was pulling out my hair trying to understand why I couldn't find my spans in Lightstep, Honeycomb, or Elastic. Restarting Firefox magically solved my problem.

Am I correct that this bug is probably to blame? And does that mean this bug is minor for Chrome but kinda huge for Firefox? (I.e., any Firefox user on my site will be invisible in my monitoring tool of choice unless they've recently restarted their browser.)

ajbouh · 2020-12-21T22:44:53Z

I saw an issue similar to this in chrome. I ended up sampling the delta just before initializing the exporter and then adjusting starttimes before upload. I doubt this is enough to actually address all possible permutations of clockskew though :/

dyladan · 2020-12-21T22:47:42Z

Am I correct that this bug is probably to blame?

Most likely

I doubt this is enough to actually address all possible permutations of clockskew though :/

Unfortunately this is also likely correct

I tried to tackle this a while ago in #1019. It is unfortunately a tricky problem. I do think it should be solved though.

dobesv · 2021-01-05T02:28:51Z

I see two related but separate issues here:

If the device sleeps/hibernates while the app is running, Date.getTime() and performance.timeOrigin + performance.now() can get out of sync because performance.now() only increments while the computer is awake, thus start & end times on spans show up at the wrong "wall clock time"
If the device sleeps/hibernates in the middle of a span, only the end time would be affected by the first issue, resulting in a much shorter duration and earlier end time than expected

Considerations:

if a low resolution timer is used (e.g. Date.getTime()) then the times would be more accurate but less precise to the point of being useless for very short spans or series of short spans close together. Also if the clock is set backwards mid-operation you might get a span that ends before it starts and have to account for that
there's no high-resolution timer that gives accurate "wall clock" times
complicated logic to try and detect and compensate for drift between the performance timer and the wall clock every time a span is created is probably not desirable
A trace can be made up of multiple spans and ideally the spans within a single trace (or span hierarchy) will be ordered on a timeline in the same order they were executed
for a distributed trace it is better for the start/end times of spans to line up with their counterparts in other services
changes to the system clock should probably not shorten span durations or cause them to be zero or negative

Note quite sure of the solution, I suppose it is in impossible to completely solve this in general.

dyladan · 2021-01-05T14:36:04Z

Note quite sure of the solution, I suppose it is in impossible to completely solve this in general.

I suspect it may be impossible to truly solve this in the general without some server to synchronize clocks. One solution is to use the low-resolution clock for start times, and the high-resolution monotonic clock for durations. This would ensure (on systems whose system time is correct) that spans have close to accurate start times and very accurate durations, while ensuring that system clock changes do not affect span timings.

obecny · 2021-01-05T14:44:03Z

I think we have been discussing this for quite long time :). I would be in favor of creating some solution and then validate it. If the only thing that is 100% up to date is new Date().getTime() then we will have to use it one way or another. We can then mix Date with performance.now calculate the deltas and return this calculation correctly. Even if we lose some microseconds it will be still much better and bullet proof comparing to what we have now. But I would be really in favour of creating some MVP as we all have some ideas how this can be done, now is probably a right time to make it happen.

dobesv · 2021-01-05T16:53:50Z

Is there a way to make this configurable on the client so people can experiment?

For example, provide a clock implementation to the opentelemetry API to use to generate timestamps?

dyladan · 2021-01-05T17:59:00Z

The interface for a clock implementation would have to make assumptions about the final implementation. For instance do we implement now() to get the current timestamp, or do we implement t = start(); t.end() and get durations?

Flarna · 2021-01-05T19:29:41Z

I think we should differentiate between node and browser regarding this. I don't think we should add unneeded overhead/complexity in node because of hibernate/sleep issues effecting effectively only browsers.

Besides that I think the API should offer both, get an absolute timestamp and get a duration.

Currently users can provide timestamps for span start/end but they have no possibility via API to use the same timesource as SDK. Having a clock interface in API would improve this.
Having the possibility to install a time provider (similar as a context manager) would even allow users to use something more fancy like a clock synced with their backend.

Oberon00 · 2022-07-12T09:55:33Z

We are actually affected by this bug on Microsoft Azure (confirmed with Azure Functions Node.js v14.18.1 on Windows), we observe time drifts of the span start in the order of 10 seconds relative to the wall clock time. I can also reproduce the issue locally be hibernating my notebook, with Node v15.9.0 on Windows 11.

So it seems to not only affect Node 8 as well.

Oberon00 · 2022-07-12T10:45:04Z

I think the wall clock time could get out of sync with the high resolution timer for any number of reasons, so many that I think it can be taken for a fact that it will drift if the application is running for any longer amount of time. Examples:

Only affecting wall clock time but not high resolution timer:

NTP sync or other time correction
Leap seconds (which do affect UTC, contrary to daylight saving time changes)

Affecting high resolution timer:

Any kind of suspension:
- "I closed my notebook lid" / hibernate / suspend / sleep -- this will mostly affect the browser and dev machines
- VM being migrated to another physical host and probably other VM shenanigans you may face when running on any PaaS/FaaS product where you-knows-what may happen to the underlying VM.
- CPU clock changes are known to cause inaccuracies in some configurations
Some hardware / firmware bugs & corner cases (at least historically)

It seems like performance.now is uv_hrtime: https://github.com/nodejs/node/blob/5fad0b93667ffc6e4def52996b9529ac99b26319/src/node_perf_common.h#L19

You found nodejs/node#17893 already, but even if the offset is calculated correctly at startup, there is just no way to solve this issue with a constant calculated-only-once offset. There is libuv/libuv#1674 for the underlying uv_hrtime API, which has been closed as stale.

uv_hrtime is implemented with QueryPerformanceCounter on Windows. https://github.com/libuv/libuv/blob/f250c6c73ee45aa93ec44133c9e0c635780ea741/src/win/util.c#L490-L514, which is definitely liable to the aforementioned time drift.

The Linux (actually Linux-specific, other Unix have different implementations) is more complicated: The generic Unix entrypoint has only one line https://github.com/libuv/libuv/blob/0b1c752b5c40a85d5c749cd30ee6811997a8f71e/src/unix/core.c#L110-L112 calling the specific part with {{UV_CLOCK_PRECISE}} https://github.com/libuv/libuv/blob/c40f8cb9f8ddf69d116952f8924a11ec0623b445/src/unix/linux-core.c#L121-L155. This ends up calling clock_gettime(CLOCK_MONOTONIC), which is a libc/POSIX API, described for Linux e.g. here https://man7.org/linux/man-pages/man3/clock_gettime.3.html:

This clock does not count time that the system is suspended.

On Linux there would be CLOCK_BOOTTIME which would (supposedly) solve this problem but I believe it is not accessible through Node. Also, there is no equivalent for Windows (unless you want something a bit less precise; though maybe CLOCK_BOOTTIME is also less precise than CLOCK_MONOTONIC).

You could look at how other SDKs solve this problem. Python has a different API https://docs.python.org/3/library/time.html#time.time_ns, which ends up in this implementation https://github.com/python/cpython/blob/0ff626f210c69643d0d5afad1e6ec6511272b3ce/Python/pytime.c#L847-L955, but that's probably not really applicable to Node.js as (an is a "a bit less precise" at least on Windows).

I think you could implement rather one to one what Java does though: There, each local root span takes a timestamp with both the most precise available wall clock time and the high resolution timer when it starts to calculate the offset. From then on, only the HR timer is used, i.e. relative offsets of local child span start times and all end times are precise. Of course, if a suspension / leap seconds / ... happens during the trace, you still have the time drift, but that is much less likely and the next trace will be correct again. This is implemented with the aptly named https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/AnchoredClock.java and this logic at span creation: https://github.com/open-telemetry/opentelemetry-java/blob/16be81aed803e15694de29c9cea25f7bcf4d77c1/sdk/trace/src/main/java/io/opentelemetry/sdk/trace/SdkSpan.java#L151-L182

guyguy333 · 2022-09-16T08:24:47Z

Is this issue fixed now that #3134 has been merged ?

dyladan · 2022-09-16T11:54:52Z

It is waiting on #3259 for the release but yes

* chore(deps): update dependency @types/node to v16 * fix: typescript issues

dyladan added the bug Something isn't working label Mar 12, 2020

dyladan added this to Needs triage in Bug Tracker Apr 29, 2020

dyladan moved this from Needs triage to High priority in Bug Tracker Apr 29, 2020

rhcarvalho mentioned this issue Sep 28, 2020

Timing issues using Performance API getsentry/sentry-javascript#2590

Closed

mwear mentioned this issue Jan 5, 2021

Add a way to compensate for clock skew #1728

Open

meredydd mentioned this issue Sep 6, 2021

time.time() should be increasing skulpt/skulpt#1358

Merged

vmarchaud mentioned this issue Jan 8, 2022

Incorret timestamps of XMLHttpRequestInstrumentation spans after system sleep on browsers. #2705

Closed

michaelyeager-wf mentioned this issue Feb 24, 2022

O11Y-1508: Change Span to use high resolution time instead of DateTime.now() Workiva/opentelemetry-dart#39

Merged

svetlanabrennan mentioned this issue Mar 16, 2022

Incorrect span timestamp for node version 8.11.4 to 8.10.0 #2834

Closed

dyladan self-assigned this Jul 29, 2022

dyladan added the priority:p2 Bugs and spec inconsistencies which cause telemetry to be incomplete or incorrect label Jul 29, 2022

dyladan mentioned this issue Jul 29, 2022

fix(sdk-trace-base): make span start times resistant to hrtime clock drift #3129

Closed

dyladan linked a pull request Sep 16, 2022 that will close this issue

chore: proposal 1.7.0/0.33.0 #3259

Merged

dyladan closed this as completed in #3259 Sep 16, 2022

alexd6631 mentioned this issue Sep 28, 2022

Inconsistent start and end time, startTime > endTime issues with express open-telemetry/opentelemetry-js-contrib#1209

Closed

dschom mentioned this issue Dec 19, 2022

fix(content): Fix buggy timing metrics mozilla/fxa#14662

Merged

4 tasks

kanongil mentioned this issue Apr 26, 2023

performance.now() should jump after sleep/suspend/hibernation nodejs/node#47724

Open

pichlermarc pushed a commit to dynatrace-oss-contrib/opentelemetry-js that referenced this issue Dec 15, 2023

chore(deps): update dependency @types/node to v16 (open-telemetry#852)

3887db0

* chore(deps): update dependency @types/node to v16 * fix: typescript issues

martinkuba pushed a commit to martinkuba/opentelemetry-js that referenced this issue Mar 13, 2024

chore(deps): update dependency @types/node to v16 (open-telemetry#852)

0a7b205

* chore(deps): update dependency @types/node to v16 * fix: typescript issues

martinkuba pushed a commit to martinkuba/opentelemetry-js that referenced this issue Mar 16, 2024

chore(deps): update dependency @types/node to v16 (open-telemetry#852)

5a6eb4a

* chore(deps): update dependency @types/node to v16 * fix: typescript issues

Sleeping computer causes span timings to be incorrect in browser #852

Sleeping computer causes span timings to be incorrect in browser #852

Comments

dyladan commented Mar 12, 2020

dyladan commented Mar 13, 2020

mayurkale22 commented Mar 13, 2020

dyladan commented Mar 13, 2020

Flarna commented Mar 16, 2020

obecny commented Mar 16, 2020

dyladan commented Mar 16, 2020

obecny commented Mar 16, 2020

dyladan commented Mar 16, 2020

obecny commented Mar 16, 2020

obecny commented Mar 16, 2020

draffensperger commented Mar 16, 2020

obecny commented Mar 16, 2020

obecny commented Mar 16, 2020 • edited

dyladan commented Mar 16, 2020

dyladan commented Mar 16, 2020

obecny commented Mar 16, 2020

Flarna commented Mar 18, 2020

dyladan commented Mar 18, 2020

pauldraper commented Mar 24, 2020

dyladan commented Mar 24, 2020

dyladan commented Mar 25, 2020 • edited

pauldraper commented Mar 25, 2020

dyladan commented Mar 25, 2020

draffensperger commented Mar 26, 2020

dyladan commented Mar 26, 2020

draffensperger commented Mar 26, 2020

dyladan commented May 5, 2020

jchamberlain commented Dec 21, 2020 • edited

ajbouh commented Dec 21, 2020

dyladan commented Dec 21, 2020

dobesv commented Jan 5, 2021 • edited

dyladan commented Jan 5, 2021

obecny commented Jan 5, 2021

dobesv commented Jan 5, 2021

dyladan commented Jan 5, 2021

Flarna commented Jan 5, 2021

Oberon00 commented Jul 12, 2022 • edited

Oberon00 commented Jul 12, 2022 • edited

guyguy333 commented Sep 16, 2022

dyladan commented Sep 16, 2022

obecny commented Mar 16, 2020 •

edited

dyladan commented Mar 25, 2020 •

edited

jchamberlain commented Dec 21, 2020 •

edited

dobesv commented Jan 5, 2021 •

edited

Oberon00 commented Jul 12, 2022 •

edited

Oberon00 commented Jul 12, 2022 •

edited