Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink how we check whether node's time is synchronized #1815

Open
marta-lokhova opened this issue Oct 24, 2018 · 13 comments · May be fixed by #4101
Open

Rethink how we check whether node's time is synchronized #1815

marta-lokhova opened this issue Oct 24, 2018 · 13 comments · May be fixed by #4101
Assignees

Comments

@marta-lokhova
Copy link
Contributor

marta-lokhova commented Oct 24, 2018

Looks like currently we check whether local time is synchronized with an NTP server (via NtpSynchronizationChecker/NtpWork/NtpClient). There are a few things:

  • The implementation looks complex and I don't think it's actually using Work/WorkParent interfaces correctly, so it should be refactored/simplified.
  • It also looks like there might be a potential race condition, since the work modifies mIsSynchronized from the worker thread (without using mutex or std::atomic)
  • Additionally, @MonsieurNicolas proposed we might not even need to check against an NTP server, but against other nodes instead.
@MonsieurNicolas
Copy link
Contributor

A few more ideas here: ideally we can report both on how the local node is doing compared to the network and how other nodes are (not) nominating properly.

@MonsieurNicolas MonsieurNicolas added this to To do in v11.2.0 via automation Apr 22, 2019
@MonsieurNicolas MonsieurNicolas removed this from To do in v11.2.0 May 28, 2019
@MonsieurNicolas MonsieurNicolas added this to To do in v12.0.0 via automation May 28, 2019
@MonsieurNicolas MonsieurNicolas added this to To do in v12.1.0 via automation Sep 11, 2019
@MonsieurNicolas MonsieurNicolas removed this from To do in v12.0.0 Sep 11, 2019
@MonsieurNicolas MonsieurNicolas added this to To do in v13.0.0 via automation Oct 10, 2019
@MonsieurNicolas MonsieurNicolas removed this from To do in v12.1.0 Oct 10, 2019
@MonsieurNicolas MonsieurNicolas removed this from To do in v13.0.0 Apr 5, 2020
@MonsieurNicolas
Copy link
Contributor

We've run into this recently. See logs:

October 11th 2021, 07:28:03.9342021-10-11T14:28:03.934 GCGB2 [Ledger INFO] Got consensus: [seq=37760044, prev=7c864e, txs=269, ops=507, sv: [ SIGNED@wirex_sg txH: 7718e1, ct: 1633962483, upgrades: [ ] ]]
ct=2021-10-11T07:28:03

	October 11th 2021, 07:28:09.277
2021-10-11T14:28:09.277 GCM6Q [Ledger INFO] Got consensus: [seq=37760045, prev=a26bde, txs=294, ops=510, sv: [ SIGNED@coinqvest_fi txH: c1c4f8, ct: 1633962484, upgrades: [ ] ]]
ct=2021-10-11T07:28:04

	October 11th 2021, 07:28:14.758
2021-10-11T14:28:14.758 GCM6Q [Ledger INFO] Got consensus: [seq=37760046, prev=73b1fc, txs=273, ops=501, sv: [ SIGNED@wirex_us txH: 61b418, ct: 1633962493, upgrades: [ ] ]]
ct=2021-10-11T07:28:13

you can clearly see that coinqvest_fi introduced a ct that was off by quite a bit. A simple way to report such anomaly can just be to compare the ct of the local node (if it started nomination) vs the value that actually externalized.

@MonsieurNicolas MonsieurNicolas added this to To do in v18.2.0 via automation Oct 11, 2021
@graydon
Copy link
Contributor

graydon commented Oct 28, 2021

NB: 000875a removed the NTP checker entirely. So we do not notice such drift-from-shared-reality at all.

@marta-lokhova marta-lokhova self-assigned this Oct 28, 2021
@MonsieurNicolas MonsieurNicolas removed this from To do in v18.2.0 Dec 1, 2021
@MonsieurNicolas MonsieurNicolas added this to To do in v18.4.0 via automation Dec 1, 2021
@MonsieurNicolas MonsieurNicolas removed this from To do in v18.4.0 Jan 25, 2022
@MonsieurNicolas MonsieurNicolas added this to To do in v18.5.0 via automation Jan 25, 2022
@marta-lokhova
Copy link
Contributor Author

@MonsieurNicolas I think trying to report how the other nodes are not nominating correctly may lead to false positives. Specifically, the ct picked for nomination depends on node externalizing latest ledger. So if the node is lagging for whatever reason, it may report other nodes as drifting (it will have ct higher than what it sees externalized)

Reporting the other way around seems good enough: a node with drifted clock will see that its ct is behind whatever was externalized. Note that if the node is lagging, we may not get a warning, but it's probably ok, as long as the node recovers and start reporting the drift correctly (I'd prefer an occasional false negative rather than a false positive that may cause confusion among the operators). Let me know what you think.

@MonsieurNicolas
Copy link
Contributor

I think you're mixing "a node not nominating in time" with "a node has a bad clock".

The first kind of problem leads to things like:

  • node starts nomination late, causing consensus to stall and timeout during nomination as a different leader is selected

The second kind of problem leads to:

  • "jitter" in close time, that gives the impression that some ledgers were slow/fast.
    • If a node has a fast clock, ct will be set too high, which will appear to be a slow ledger in the dashboard. The next leader will rectify the drift back (which will show up as a "fast ledger" in the dashboard).
    • The other way is true as well. If a node has a slow clock, it will nominate with ct too small which looks like a fast ledger.
  • if the delta is too high, may lead to multiple "1 second" ledgers (as nodes are forced to vote with ct=lcl.ct+1)
  • if the delta is "way too high", the value will be considered invalid by other nodes until it gets back into the validity range or may cause the node to throw away messages from other nodes (and fall out of sync).

This issue deals with the second kind of problem: "bad clock".

Ideally whatever solution we come up with allows to warn:

  • the local operator that they should fix their clock
  • other node operators that they have a bad validator in their quorum set

I think it's acceptable to warn if a node is always behind: this points to a problem with that node's ability to follow quorums.
I also think that we can work off the assumption that clocks drift slowly, so we can average things out over relatively long periods (like minutes) if needed - this should allow to cancel natural variations seen when dealing with network conditions.

@marta-lokhova
Copy link
Contributor Author

all I was saying is that the ct picked by a node with a bad clock is also impacted by how the node is doing overall, because ct also depends on ledger trigger timing. So my question was around the accuracy of reporting based on that.

@marta-lokhova
Copy link
Contributor Author

(copying a discussion from elsewhere for posterity)

what I’m trying to convey is that with the approach proposed in this issue, where we rely on relative time on other validators, we will have a lot of noise. Imagine a perfect identical clock on all validators. Even with that, nodes will experience various levels of drift based on their processing of network events.

What I don’t want is operators getting confused over bogus reporting, so I’d rather report conservatively but with higher probability there’s an issue.

As for the questions above:

  • "delta way too high" - I’m not considering that case. We allow a large enough close time slip. Assuming clocks are drifting slowly, this is plenty of time for the operator to fix this.
  • “1 second” ledgers vs “slow” ledgers - this is a valid case to consider. Self reporting vs reporting on others are kind of similar here. Let’s say your local CT is behind what was externalized. Is it your local clock or does the nomination leader have a bad clock? Locally, you have CT for every ledger so it’s easy to average the data and see if you’re consistency behind, whereas for others you only know the CT when they’re a leader. I mean, it’s possible to extract that info and store it per node, but this solution is quite a bit more complicated and it's unclear how valuable it would be.

@marta-lokhova
Copy link
Contributor Author

summarizing the discussion on the issue

  • a simple solution would be to implement some self-reporting based on either NTP or close time. But from the experience, self-reporting isn't sufficient, because operators tend to neglect the warnings, and typically someone else discovers that the node has issues.
  • a better alternative would be to have other nodes report on a node with a bad clock. However, that approach is quite a bit more complicated (we need to track each node in a qset), and the ROI doesn't seem very high here
  • Another problem with relative close time checks is that close time is heavily dependent on how well the node can keep up with the network, which can result in a lot of noise in CT measurements.

So for now we concluded that the actual code change isn't worth pursuing here, and nodes should maintain the correct time outside of code.

v18.5.0 automation moved this from To do to Done Mar 29, 2022
@leighmcculloch
Copy link
Member

I recently experienced a situation where I was running stellar-core inside a VM, and for reasons irrelevant the VM's time drifted significantly. Upon a fresh boot of stellar-core with no prior data, the node didn't join the network, but also didn't present any error messages so it was difficult to debug that time drift was the culprit.

root@a9559ec5a45a:/# cat /var/log/supervisor/stellar-core-stderr---supervisor-8ve9wgve.log 
2022-10-22T00:12:50.299 [default INFO] Config from /opt/stellar/core/etc/stellar-core.cfg
2022-10-22T00:12:50.317 [default INFO] Generated QUORUM_SET: {
   "t" : 1,
   "v" : [ "sdf_futurenet_1" ]
}

2022-10-22T00:12:50.327 GBTM5 [default INFO] Starting stellar-core stellar-core 19.4.1.soroban (4e813f20e26a7faad4f5f5f533b82e527cb79e9a)
2022-10-22T00:12:50.339 GBTM5 [Database INFO] Connecting to: postgresql://dbname=core host=localhost user=stellar password=********
2022-10-22T00:12:50.505 GBTM5 [SCP INFO] LocalNode::LocalNode@GBTM5 qSet: 94ea46
2022-10-22T00:12:50.540 GBTM5 [Database INFO] DB schema is in current version
2022-10-22T00:12:50.548 GBTM5 [default INFO] Listening on 0.0.0.0:11626 for HTTP requests
2022-10-22T00:12:50.550 GBTM5 [default INFO] Connection effective settings:
2022-10-22T00:12:50.551 GBTM5 [default INFO] TARGET_PEER_CONNECTIONS: 8
2022-10-22T00:12:50.551 GBTM5 [default INFO] MAX_ADDITIONAL_PEER_CONNECTIONS: 64
2022-10-22T00:12:50.551 GBTM5 [default INFO] MAX_PENDING_CONNECTIONS: 500
2022-10-22T00:12:50.551 GBTM5 [default INFO] MAX_OUTBOUND_PENDING_CONNECTIONS: 56
2022-10-22T00:12:50.551 GBTM5 [default INFO] MAX_INBOUND_PENDING_CONNECTIONS: 444
2022-10-22T00:12:50.551 GBTM5 [History INFO] Archive 'sdf_futurenet_1' has 'get' command only, will not be written
2022-10-22T00:12:50.551 GBTM5 [History WARNING] No writable archives configured, history will not be written.
2022-10-22T00:12:50.552 GBTM5 [Ledger INFO] Last closed ledger (LCL) hash is 471c12c0dd073c1c8a70cd722a7c4ae26fc6b9d42923a52459354dc69d4efd03
2022-10-22T00:12:50.564 GBTM5 [Ledger INFO] Loaded LCL header from database: [seq=1, hash=471c12]
2022-10-22T00:12:50.569 GBTM5 [Ledger INFO] Starting up application
2022-10-22T00:12:50.571 GBTM5 [Ledger INFO] Last closed ledger (LCL) hash is 471c12c0dd073c1c8a70cd722a7c4ae26fc6b9d42923a52459354dc69d4efd03
2022-10-22T00:12:50.574 GBTM5 [Ledger INFO] Loaded LCL header from database: [seq=1, hash=471c12]
2022-10-22T00:12:50.586 GBTM5 [Ledger INFO] Assumed bucket-state for LCL: [seq=1, hash=471c12]
2022-10-22T00:12:50.603 GBTM5 [Overlay INFO] Binding to endpoint 0.0.0.0:11625
2022-10-22T00:12:50.615 GBTM5 [History INFO] Downloading history archive state: .well-known/stellar-history.json
2022-10-22T00:12:50.973 GBTM5 [History INFO] Archive information: [name: sdf_futurenet_1, server: stellar-core 19.4.1.soroban (4e813f20e26a7faad4f5f5f533b82e527cb79e9a), currentLedger: 440383]
2022-10-22T00:18:49.630 GBTM5 [default INFO] Performing maintenance
2022-10-22T00:18:49.640 GBTM5 [History INFO] Trimming history <= ledger 0 (rmin=1, qmin=1, lmin=0)
2022-10-22T00:24:48.726 GBTM5 [default INFO] Performing maintenance
2022-10-22T00:24:48.729 GBTM5 [History INFO] Trimming history <= ledger 0 (rmin=1, qmin=1, lmin=0)

@MonsieurNicolas mentioned to me it relates to this issue.

@MonsieurNicolas
Copy link
Contributor

I think it would be super useful to give some idea that there is something wrong with the clock when the node is out of sync.
As mentioned in earlier comments we would have to be careful not to log too aggressively even in that situation, but dealing with that scenario (most validators in the node's quorum being "ahead") seems way easier than the general problem.

@MonsieurNicolas MonsieurNicolas added this to To do in v19.6.0 via automation Oct 24, 2022
@marta-lokhova
Copy link
Contributor Author

I think as part of this issue, we should also audit all places where we use system clock and determine how problematic it'd be for the network to have drifting nodes.

@MonsieurNicolas MonsieurNicolas removed this from To do in v19.6.0 Dec 7, 2022
@MonsieurNicolas MonsieurNicolas added this to To do in v19.7.0 via automation Dec 7, 2022
@marta-lokhova marta-lokhova removed this from To do in v19.7.0 Jan 25, 2023
@marta-lokhova marta-lokhova added this to To do in v19.8.0 via automation Jan 25, 2023
@marta-lokhova
Copy link
Contributor Author

Observed on pubnet today:

2023-10-02T17:24:51.088 GABMK [Ledger INFO] Got consensus: [seq=48379557, prev=3e6e1a, txs=238, ops=1000, sv: [ SIGNED@lobstr_3_north_america txH: d3f0f0, ct: 1696267490, upgrades: [ ] ]]

2023-10-02T17:24:45.477 GABMK [Ledger INFO] Got consensus: [seq=48379556, prev=5e6e5a, txs=307, ops=778, sv: [ SIGNED@coinqvest_de txH: 23ceb6, ct: 1696267479, upgrades: [ ] ]]

2023-10-02T17:24:40.050 GABMK [Ledger INFO] Got consensus: [seq=48379555, prev=944a10, txs=326, ops=921, sv: [ SIGNED@satoshipay_us txH: baddd0, ct: 1696267478, upgrades: [ ] ]]

coinqvest_de clock has drifted, causing misleading ct values

@MonsieurNicolas
Copy link
Contributor

We may want to split the problem into:

  • local node is drifting compared to other validators (need to signal this to operator)
  • some other node is drifting compared to local

Maybe a new metric "number of validators in quorum set with clock skew compared to local" here would help in both cases? I think we can only estimate it over time which is not great, but better than nothing.

@marta-lokhova marta-lokhova linked a pull request Dec 19, 2023 that will close this issue
@marta-lokhova marta-lokhova self-assigned this Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

4 participants