Rethink how we check whether node's time is synchronized #1815

marta-lokhova · 2018-10-24T23:18:57Z

Looks like currently we check whether local time is synchronized with an NTP server (via NtpSynchronizationChecker/NtpWork/NtpClient). There are a few things:

The implementation looks complex and I don't think it's actually using Work/WorkParent interfaces correctly, so it should be refactored/simplified.
It also looks like there might be a potential race condition, since the work modifies mIsSynchronized from the worker thread (without using mutex or std::atomic)
Additionally, @MonsieurNicolas proposed we might not even need to check against an NTP server, but against other nodes instead.

The text was updated successfully, but these errors were encountered:

MonsieurNicolas · 2019-04-16T01:40:52Z

A few more ideas here: ideally we can report both on how the local node is doing compared to the network and how other nodes are (not) nominating properly.

MonsieurNicolas · 2021-10-11T17:38:56Z

We've run into this recently. See logs:

October 11th 2021, 07:28:03.9342021-10-11T14:28:03.934 GCGB2 [Ledger INFO] Got consensus: [seq=37760044, prev=7c864e, txs=269, ops=507, sv: [ SIGNED@wirex_sg txH: 7718e1, ct: 1633962483, upgrades: [ ] ]]
ct=2021-10-11T07:28:03

	October 11th 2021, 07:28:09.277
2021-10-11T14:28:09.277 GCM6Q [Ledger INFO] Got consensus: [seq=37760045, prev=a26bde, txs=294, ops=510, sv: [ SIGNED@coinqvest_fi txH: c1c4f8, ct: 1633962484, upgrades: [ ] ]]
ct=2021-10-11T07:28:04

	October 11th 2021, 07:28:14.758
2021-10-11T14:28:14.758 GCM6Q [Ledger INFO] Got consensus: [seq=37760046, prev=73b1fc, txs=273, ops=501, sv: [ SIGNED@wirex_us txH: 61b418, ct: 1633962493, upgrades: [ ] ]]
ct=2021-10-11T07:28:13

you can clearly see that coinqvest_fi introduced a ct that was off by quite a bit. A simple way to report such anomaly can just be to compare the ct of the local node (if it started nomination) vs the value that actually externalized.

graydon · 2021-10-28T20:42:40Z

NB: 000875a removed the NTP checker entirely. So we do not notice such drift-from-shared-reality at all.

marta-lokhova · 2022-03-24T23:36:55Z

@MonsieurNicolas I think trying to report how the other nodes are not nominating correctly may lead to false positives. Specifically, the ct picked for nomination depends on node externalizing latest ledger. So if the node is lagging for whatever reason, it may report other nodes as drifting (it will have ct higher than what it sees externalized)

Reporting the other way around seems good enough: a node with drifted clock will see that its ct is behind whatever was externalized. Note that if the node is lagging, we may not get a warning, but it's probably ok, as long as the node recovers and start reporting the drift correctly (I'd prefer an occasional false negative rather than a false positive that may cause confusion among the operators). Let me know what you think.

MonsieurNicolas · 2022-03-25T22:48:23Z

I think you're mixing "a node not nominating in time" with "a node has a bad clock".

The first kind of problem leads to things like:

node starts nomination late, causing consensus to stall and timeout during nomination as a different leader is selected

The second kind of problem leads to:

"jitter" in close time, that gives the impression that some ledgers were slow/fast.
- If a node has a fast clock, ct will be set too high, which will appear to be a slow ledger in the dashboard. The next leader will rectify the drift back (which will show up as a "fast ledger" in the dashboard).
- The other way is true as well. If a node has a slow clock, it will nominate with ct too small which looks like a fast ledger.
if the delta is too high, may lead to multiple "1 second" ledgers (as nodes are forced to vote with ct=lcl.ct+1)
if the delta is "way too high", the value will be considered invalid by other nodes until it gets back into the validity range or may cause the node to throw away messages from other nodes (and fall out of sync).

This issue deals with the second kind of problem: "bad clock".

Ideally whatever solution we come up with allows to warn:

the local operator that they should fix their clock
other node operators that they have a bad validator in their quorum set

I think it's acceptable to warn if a node is always behind: this points to a problem with that node's ability to follow quorums.
I also think that we can work off the assumption that clocks drift slowly, so we can average things out over relatively long periods (like minutes) if needed - this should allow to cancel natural variations seen when dealing with network conditions.

marta-lokhova · 2022-03-28T16:51:14Z

all I was saying is that the ct picked by a node with a bad clock is also impacted by how the node is doing overall, because ct also depends on ledger trigger timing. So my question was around the accuracy of reporting based on that.

marta-lokhova · 2022-03-29T00:14:28Z

(copying a discussion from elsewhere for posterity)

what I’m trying to convey is that with the approach proposed in this issue, where we rely on relative time on other validators, we will have a lot of noise. Imagine a perfect identical clock on all validators. Even with that, nodes will experience various levels of drift based on their processing of network events.

What I don’t want is operators getting confused over bogus reporting, so I’d rather report conservatively but with higher probability there’s an issue.

As for the questions above:

"delta way too high" - I’m not considering that case. We allow a large enough close time slip. Assuming clocks are drifting slowly, this is plenty of time for the operator to fix this.
“1 second” ledgers vs “slow” ledgers - this is a valid case to consider. Self reporting vs reporting on others are kind of similar here. Let’s say your local CT is behind what was externalized. Is it your local clock or does the nomination leader have a bad clock? Locally, you have CT for every ledger so it’s easy to average the data and see if you’re consistency behind, whereas for others you only know the CT when they’re a leader. I mean, it’s possible to extract that info and store it per node, but this solution is quite a bit more complicated and it's unclear how valuable it would be.

marta-lokhova · 2022-03-29T00:22:35Z

summarizing the discussion on the issue

a simple solution would be to implement some self-reporting based on either NTP or close time. But from the experience, self-reporting isn't sufficient, because operators tend to neglect the warnings, and typically someone else discovers that the node has issues.
a better alternative would be to have other nodes report on a node with a bad clock. However, that approach is quite a bit more complicated (we need to track each node in a qset), and the ROI doesn't seem very high here
Another problem with relative close time checks is that close time is heavily dependent on how well the node can keep up with the network, which can result in a lot of noise in CT measurements.

So for now we concluded that the actual code change isn't worth pursuing here, and nodes should maintain the correct time outside of code.

leighmcculloch · 2022-10-24T20:09:28Z

I recently experienced a situation where I was running stellar-core inside a VM, and for reasons irrelevant the VM's time drifted significantly. Upon a fresh boot of stellar-core with no prior data, the node didn't join the network, but also didn't present any error messages so it was difficult to debug that time drift was the culprit.

root@a9559ec5a45a:/# cat /var/log/supervisor/stellar-core-stderr---supervisor-8ve9wgve.log 
2022-10-22T00:12:50.299 [default INFO] Config from /opt/stellar/core/etc/stellar-core.cfg
2022-10-22T00:12:50.317 [default INFO] Generated QUORUM_SET: {
   "t" : 1,
   "v" : [ "sdf_futurenet_1" ]
}

2022-10-22T00:12:50.327 GBTM5 [default INFO] Starting stellar-core stellar-core 19.4.1.soroban (4e813f20e26a7faad4f5f5f533b82e527cb79e9a)
2022-10-22T00:12:50.339 GBTM5 [Database INFO] Connecting to: postgresql://dbname=core host=localhost user=stellar password=********
2022-10-22T00:12:50.505 GBTM5 [SCP INFO] LocalNode::LocalNode@GBTM5 qSet: 94ea46
2022-10-22T00:12:50.540 GBTM5 [Database INFO] DB schema is in current version
2022-10-22T00:12:50.548 GBTM5 [default INFO] Listening on 0.0.0.0:11626 for HTTP requests
2022-10-22T00:12:50.550 GBTM5 [default INFO] Connection effective settings:
2022-10-22T00:12:50.551 GBTM5 [default INFO] TARGET_PEER_CONNECTIONS: 8
2022-10-22T00:12:50.551 GBTM5 [default INFO] MAX_ADDITIONAL_PEER_CONNECTIONS: 64
2022-10-22T00:12:50.551 GBTM5 [default INFO] MAX_PENDING_CONNECTIONS: 500
2022-10-22T00:12:50.551 GBTM5 [default INFO] MAX_OUTBOUND_PENDING_CONNECTIONS: 56
2022-10-22T00:12:50.551 GBTM5 [default INFO] MAX_INBOUND_PENDING_CONNECTIONS: 444
2022-10-22T00:12:50.551 GBTM5 [History INFO] Archive 'sdf_futurenet_1' has 'get' command only, will not be written
2022-10-22T00:12:50.551 GBTM5 [History WARNING] No writable archives configured, history will not be written.
2022-10-22T00:12:50.552 GBTM5 [Ledger INFO] Last closed ledger (LCL) hash is 471c12c0dd073c1c8a70cd722a7c4ae26fc6b9d42923a52459354dc69d4efd03
2022-10-22T00:12:50.564 GBTM5 [Ledger INFO] Loaded LCL header from database: [seq=1, hash=471c12]
2022-10-22T00:12:50.569 GBTM5 [Ledger INFO] Starting up application
2022-10-22T00:12:50.571 GBTM5 [Ledger INFO] Last closed ledger (LCL) hash is 471c12c0dd073c1c8a70cd722a7c4ae26fc6b9d42923a52459354dc69d4efd03
2022-10-22T00:12:50.574 GBTM5 [Ledger INFO] Loaded LCL header from database: [seq=1, hash=471c12]
2022-10-22T00:12:50.586 GBTM5 [Ledger INFO] Assumed bucket-state for LCL: [seq=1, hash=471c12]
2022-10-22T00:12:50.603 GBTM5 [Overlay INFO] Binding to endpoint 0.0.0.0:11625
2022-10-22T00:12:50.615 GBTM5 [History INFO] Downloading history archive state: .well-known/stellar-history.json
2022-10-22T00:12:50.973 GBTM5 [History INFO] Archive information: [name: sdf_futurenet_1, server: stellar-core 19.4.1.soroban (4e813f20e26a7faad4f5f5f533b82e527cb79e9a), currentLedger: 440383]
2022-10-22T00:18:49.630 GBTM5 [default INFO] Performing maintenance
2022-10-22T00:18:49.640 GBTM5 [History INFO] Trimming history <= ledger 0 (rmin=1, qmin=1, lmin=0)
2022-10-22T00:24:48.726 GBTM5 [default INFO] Performing maintenance
2022-10-22T00:24:48.729 GBTM5 [History INFO] Trimming history <= ledger 0 (rmin=1, qmin=1, lmin=0)

@MonsieurNicolas mentioned to me it relates to this issue.

MonsieurNicolas · 2022-10-24T21:09:12Z

I think it would be super useful to give some idea that there is something wrong with the clock when the node is out of sync.
As mentioned in earlier comments we would have to be careful not to log too aggressively even in that situation, but dealing with that scenario (most validators in the node's quorum being "ahead") seems way easier than the general problem.

marta-lokhova · 2022-11-11T19:59:04Z

I think as part of this issue, we should also audit all places where we use system clock and determine how problematic it'd be for the network to have drifting nodes.

marta-lokhova · 2023-10-02T17:53:44Z

Observed on pubnet today:

2023-10-02T17:24:51.088 GABMK [Ledger INFO] Got consensus: [seq=48379557, prev=3e6e1a, txs=238, ops=1000, sv: [ SIGNED@lobstr_3_north_america txH: d3f0f0, ct: 1696267490, upgrades: [ ] ]]

2023-10-02T17:24:45.477 GABMK [Ledger INFO] Got consensus: [seq=48379556, prev=5e6e5a, txs=307, ops=778, sv: [ SIGNED@coinqvest_de txH: 23ceb6, ct: 1696267479, upgrades: [ ] ]]

2023-10-02T17:24:40.050 GABMK [Ledger INFO] Got consensus: [seq=48379555, prev=944a10, txs=326, ops=921, sv: [ SIGNED@satoshipay_us txH: baddd0, ct: 1696267478, upgrades: [ ] ]]

coinqvest_de clock has drifted, causing misleading ct values

MonsieurNicolas · 2023-10-09T18:14:05Z

We may want to split the problem into:

local node is drifting compared to other validators (need to signal this to operator)
some other node is drifting compared to local

Maybe a new metric "number of validators in quorum set with clock skew compared to local" here would help in both cases? I think we can only estimate it over time which is not great, but better than nothing.

marta-lokhova added the discussion label Oct 24, 2018

marta-lokhova mentioned this issue Jan 28, 2019

Remove NtpSynchronizationChecker and related classes #1929

Merged

MonsieurNicolas mentioned this issue Apr 16, 2019

Add msg to info when your clock is off from the network #2056

Closed

MonsieurNicolas added this to To do in v11.2.0 via automation Apr 22, 2019

MonsieurNicolas removed this from To do in v11.2.0 May 28, 2019

MonsieurNicolas added this to To do in v12.0.0 via automation May 28, 2019

MonsieurNicolas added this to To do in v12.1.0 via automation Sep 11, 2019

MonsieurNicolas removed this from To do in v12.0.0 Sep 11, 2019

MonsieurNicolas added this to To do in v13.0.0 via automation Oct 10, 2019

MonsieurNicolas removed this from To do in v12.1.0 Oct 10, 2019

MonsieurNicolas removed this from To do in v13.0.0 Apr 5, 2020

MonsieurNicolas added this to To do in v18.2.0 via automation Oct 11, 2021

marta-lokhova self-assigned this Oct 28, 2021

MonsieurNicolas removed this from To do in v18.2.0 Dec 1, 2021

MonsieurNicolas added this to To do in v18.4.0 via automation Dec 1, 2021

MonsieurNicolas unassigned marta-lokhova Jan 25, 2022

MonsieurNicolas removed this from To do in v18.4.0 Jan 25, 2022

MonsieurNicolas added this to To do in v18.5.0 via automation Jan 25, 2022

marta-lokhova closed this as completed Mar 29, 2022

v18.5.0 automation moved this from To do to Done Mar 29, 2022

MonsieurNicolas reopened this Oct 24, 2022

MonsieurNicolas added this to To do in v19.6.0 via automation Oct 24, 2022

MonsieurNicolas removed this from To do in v19.6.0 Dec 7, 2022

MonsieurNicolas added this to To do in v19.7.0 via automation Dec 7, 2022

marta-lokhova removed this from To do in v19.7.0 Jan 25, 2023

marta-lokhova added this to To do in v19.8.0 via automation Jan 25, 2023

marta-lokhova linked a pull request Dec 19, 2023 that will close this issue

Report drifting close time on local node #4101

Open

marta-lokhova self-assigned this Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink how we check whether node's time is synchronized #1815

Rethink how we check whether node's time is synchronized #1815

marta-lokhova commented Oct 24, 2018 •

edited

MonsieurNicolas commented Apr 16, 2019

MonsieurNicolas commented Oct 11, 2021

graydon commented Oct 28, 2021

marta-lokhova commented Mar 24, 2022

MonsieurNicolas commented Mar 25, 2022

marta-lokhova commented Mar 28, 2022

marta-lokhova commented Mar 29, 2022

marta-lokhova commented Mar 29, 2022

leighmcculloch commented Oct 24, 2022

MonsieurNicolas commented Oct 24, 2022

marta-lokhova commented Nov 11, 2022

marta-lokhova commented Oct 2, 2023

MonsieurNicolas commented Oct 9, 2023

Rethink how we check whether node's time is synchronized #1815

Rethink how we check whether node's time is synchronized #1815

Comments

marta-lokhova commented Oct 24, 2018 • edited

MonsieurNicolas commented Apr 16, 2019

MonsieurNicolas commented Oct 11, 2021

graydon commented Oct 28, 2021

marta-lokhova commented Mar 24, 2022

MonsieurNicolas commented Mar 25, 2022

marta-lokhova commented Mar 28, 2022

marta-lokhova commented Mar 29, 2022

marta-lokhova commented Mar 29, 2022

leighmcculloch commented Oct 24, 2022

MonsieurNicolas commented Oct 24, 2022

marta-lokhova commented Nov 11, 2022

marta-lokhova commented Oct 2, 2023

MonsieurNicolas commented Oct 9, 2023

marta-lokhova commented Oct 24, 2018 •

edited