Recovery process can result in old snapshot getting sent to node #1717

otoolep · 2024-03-05T13:17:16Z

Started with #1711

This may be another race between nodes coming up during the Recovery process and pushing various snapshots to each other, with no guarantees when they arrive and are installed -- and the reap removing older snapshots.

otoolep · 2024-03-05T13:20:09Z

@dwco-z - this may not be solvable through code, and may require some changes to the Recovery process steps.

dwco-z · 2024-03-05T13:23:54Z

@otoolep do you mean maybe I should adjust the way I'm doing the recovery? e.g., adding some random timeout during each node's startup to reduce the chances of the race condition?

otoolep · 2024-03-05T13:42:14Z

Not sure, perhaps blow away snapshots in every directory except one node. Not advocating this right now, I need to look into the code more.

otoolep · 2024-03-05T14:51:03Z

@dwco-z -- maybe you can run the latest code, which will be available once this build completes: https://ci.appveyor.com/project/otoolep/rqlite/builds/49333481

I couldn't reproduce the issue locally, so your help would be good. Re-run your test, but with -raft-log-level=INFO, get the issue to occur, and then attach the logs to this bug.

dwco-z · 2024-03-05T14:52:15Z

Sure, will do

dwco-z · 2024-03-05T15:06:38Z

Here it is, I changed from-raft-log-level=DEBUG to -raft-log-level=INFO and captured the issue again:
node2.log
node3.log
node1.log

otoolep · 2024-03-05T15:13:56Z

OK, as I suspect, more snapshot reaping going on.

It's not immediately clear to me how to solve this issue. I'll need to think about it.

otoolep · 2024-03-05T15:22:30Z

Couple of ideas, for my own reference:

dig into why the Leader is sending an old snapshot to the follower
consider reaping the snapshot in the future, if a node gets an older snapshot. Need to be cautious, as this could be a sign that something has gone wrong.

dwco-z · 2024-03-05T15:57:30Z

OK, as I suspect, more snapshot reaping going on.

It's not immediately clear to me how to solve this issue. I'll need to think about it.

Ok, anytime you need me to test changes just let me know. I believe you could not reproduce the bug because of some timing difference as you mentioned before.

dwco-z · 2024-03-06T11:41:07Z

Just in case you need logs from the last created PR:
node2.log
node3.log
node1.log

otoolep · 2024-03-06T13:28:08Z

Thanks, no surprises, what I expected is happening.

otoolep · 2024-03-06T14:01:21Z

@dwco-z -- can you run again, using the latest build from the same PR (I've updated the code a little) and run with -raft-log-level=DEBUG? Then send me the logs again -- thanks.

#1719

dwco-z · 2024-03-06T14:06:11Z

Sure

otoolep · 2024-03-06T14:07:48Z

I've actually merged the PR to master, because until we get to the bottom of this, getting an older snapshot breaks rqlite. The Raft paper does reference this possibility (and suggests ways to deal with it), but either the Raft layer itself is sending it incorrectly, or rqlite needs to handle it better. I want to know which it is before removing this check.

dwco-z · 2024-03-06T14:11:40Z

I've actually merged the PR to master, because until we get to the bottom of this, getting an older snapshot breaks rqlite. The Raft paper does reference this possibility (and suggests ways to deal with it), but either the Raft layer itself is sending it incorrectly, or rqlite needs to handle it better. I want to know which it is before removing this check.

That makes sense. As soon as AppVeyor completes I'll send you the logs.

dwco-z · 2024-03-06T14:24:10Z

New logs
node2.log
node3.log
node1.log

otoolep · 2024-03-06T17:53:28Z

OK, thanks, logs are good to see. However, this issue is quite complex, with a lot of moving parts. Unless I can get it to happen locally, it's going to be difficult to solve. It's very timing-dependent and I don't know if it's a bug in the Raft code, a bug in rqlite code, or just that recovering a cluster like this it at the edge of what Raft and rqlite support (Recovery is not something that is described in the Raft paper).

I'll continue to see if I can reproduce this locally, but unless I do, I do not expect to solve this anytime soon. A more robust Recovery process may simply be to always recover one node first, and then join new nodes to it.

otoolep · 2024-03-06T17:58:35Z

@dwco-z -- when you run your script, does it exit automatically when the issue happens? Or are you just watching the logs?

dwco-z · 2024-03-06T18:25:44Z

@dwco-z -- when you run your script, does it exit automatically when the issue happens? Or are you just watching the logs?

It exit automatically

leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
stopping rqlited...
stopped!
stopping rqlited...
stopped!
stopping rqlited...
stopped!
Traceback (most recent call last):
File "C:\Projects\rqlite-throubleshoot\main.py", line 85, in
start_all_instances(all_instances)
File "C:\Projects\rqlite-throubleshoot\main.py", line 55, in start_all_instances
future.result()
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures_base.py", line 401, in __get_result
raise self._exception
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Projects\rqlite-throubleshoot\main.py", line 49, in start_manager
manager.start_rqlited(joinAddresses)
File "C:\Projects\rqlite-throubleshoot\rqlite_manager.py", line 58, in start_rqlited
raise Exception('something is wrong, leader is not ready...')
Exception: something is wrong, leader is not ready...
→ C:\Projects\rqlite-throubleshoot [rqlite-8.22.1-issue]›

dwco-z · 2024-03-06T18:26:17Z

I'll try running on my linux machine with python 3.8 to see if I can reproduce it there

dwco-z · 2024-03-06T18:56:12Z

I pushed some changes to rqlite-throubleshooting to make it compatible with linux, and I was able to hit the issue in a linux mint using the latest release rqlite-v8.22.2-linux-amd64.tar.gz

otoolep · 2024-03-06T20:01:32Z

Great, it's happening for me now with your latest script.

otoolep · 2024-03-06T20:01:40Z

Took about 1-2 minutes.

dwco-z · 2024-03-06T20:02:56Z

Nice!

otoolep · 2024-03-06T20:32:32Z

Sometimes the script is exiting, but the logs don't show the issue. Just an FYI.

otoolep · 2024-03-06T20:33:08Z

node1.log
node2.log
node3.log

otoolep · 2024-03-06T22:30:40Z

The logs are helping to understand this issue.

I'm not sure yet, but the RecoverNode documentation (which I got from Consul) does state that uncommitted entries and data loss can result from this process. When the issue happens I am seeing confusing elections taking place, and the logs are not entirely clear -- it could be the side-effects of this process. It may be that the RecoverNode process is not entirely bulletproof, and is a best-effort thing. Even the Raft docs say they've had issues in the past with this area of the code.

I will continue to think about this, and have some ideas for a fix, but I'm reluctant to make changes at this time that affect the core of rqlite for an edge case (which may even be a best effort thing). The only truly robust way to recover a cluster that has suffered a big loss is to take the SQLite database out from under a node (or use a backup) and rebuild the cluster.

dwco-z · 2024-03-07T12:48:23Z

Sometimes the script is exiting, but the logs don't show the issue. Just an FYI.

Judging from the node3.log, it seems node3 received a interrupt signal [rqlited] 2024/03/06 15:30:07 received signal "interrupt", but for some reason it cannot stop because it seems to be trying to contact the other two nodes that stopped, so 30 seconds later at 15:30:37 the RqliteManager.stop_rqlited timeouts and fallback to process.terminate() which for some reason didn't stop everything until the next start so the port was still being hold, I guess changing to process.kill() might avoid this

dwco-z · 2024-03-07T13:01:50Z

One thing I find odd is that although snapshot and readyz?sync are two completely different concepts as you already mentioned, I cannot reproduce without the sync flag (I tried 260 times) while if I make other changes like inserting a random timeout before each node's start, I can still hit the issue eventually

otoolep · 2024-03-07T13:04:00Z

Sync simply changes the timing of the whole test substantially, that is my suspicion for the difference. Sync absolutely is a read-only operation, and does not change the operation of a node.

#1717

otoolep · 2024-03-07T14:36:27Z

So this is what is happening, according to logs from my latest run of this:

All 3 nodes start fresh.
Node 1 wins the initial election, increasing its term to 2.
Node 2 starts an election too, increases its term 3, and wins the election
Node 3 next starts an election, increases its term to 4, and wins the election

So basically all 3 nodes decide to start an election fairly quickly due to -raft-timeout being 2 seconds.

Then the nodes are stopped, and a recovery process is triggered. As part of the Recovery process, each node creates a snapshot.

Node 1 creates 2-2-TTTTT as its snapshot where TTTTT is a timestamp. 2-2 comes from the last log index and term in the node's log.
Node 2 creates 3-3-TTTTT as its snapshot.
Node 3 creates 4-4-TTTTT as its snapshot.

Node 2 then triggers an election, bumps its term to 5, and wins with votes from itself and Node 3 (Node 1 hasn't finished its recovery so doesn't vote). Node 2 starts replication of logs to other nodes (via AppendEntries requests). Node 2 checks if Node 3 has the log at Index 3, which Node 3 doesn't as its compacted its log during the Recovery process. This triggers Node 2 to send its snapshot 3-3-TTTTT to Node 3, which Node 3 rejects as it has a snapshot from a later term, 4-4-TTTTT.

And then we're stuck. This is because rqlite assumes that any snapshot received from another node, with instructions to install that snapshot, will be from index and term at least equal to that of the latest snapshot the receiving node has (this was made explicit on master recently, but was implicit in the Snapshot reaping logic). This is not true on Node 3.

One solution to this issue is to remove this assumption, and allow "older" snapshots to be installed. To allow this the Snapshot-reaping process would need to be changed so that it can reap snapshots both older and newer (in terms of index and term) than the most recently installed snapshot (today it just reaps everything but the "newest in index and term", which wipes out the most recently installed snapshot). The Raft paper indicates that this should be OK because it is Node 2 that is sending the snapshot, Node 2 is leader, and the Raft paper says that Node 2 couldn't become Leader in the first place if it didn't have all the log entries that every other node had too (section 5.4.1). So in that sense it should be "safe" and no data should be lost.

I'll think about this more, but I think Snapshot Reaping needs to be changed from "reap all but the latest snapshot" to "reap all but the most recently installed snapshot". This should solve this problem.

dwco-z · 2024-03-07T17:07:34Z

Oh, now I can see why the timing is relevant in leading to this series of events, yeah that makes sense

otoolep · 2024-03-08T01:07:36Z

Yeah, very timing dependent, that's why it takes a few attempts.

I can't rush into fixing this however. While in principle it's fairly easy to see a possible fix, the hard parts are dealing with crash scenarios. rqlite must handle crashing at anytime, particularly while taking a snapshot. There is important logic that rqlite runs at start-up, checking for inconsistencies. That logic today assumes that the most recent snapshot it finds is the desired one. That assumption could be changed by any fix for this particular issue -- and that might involve even more complicated consistency checking code, which can result in new bugs. Since crashing during snapshotting is much more likely to happen than this issue (rqlited can be killed at anytime by an operator), I have to consider changes very carefully.

otoolep · 2024-03-08T01:08:47Z

By crashing I also mean "exiting hard unexpectedly", perhaps due to SIGKILL being sent to rqlited.

rqlite/rqlite#1717

otoolep · 2024-03-08T01:14:46Z

I've updated the node-recovery docs due to this issue.

otoolep · 2024-03-08T01:17:18Z

@dwco-z -- if my reasoning is correct here, I suspect you'll see much robust behavior if you increase the election timeout, maybe try 5 seconds. We want to reduce the rate of elections during the recovery process.

otoolep · 2024-03-08T01:19:40Z

@dwco-z -- correction, increase the election timeout for 2 out of the 3 nodes, thereby ensuring 1 node will be elected much faster than the other two nodes. Make two of the nodes, say, 10 seconds, and leave the other at 2 seconds.

Yes, this is lots of special casing, but Node Recovery, by its nature, is meant to be something done under very controlled conditions.

dwco-z · 2024-03-08T19:10:52Z

I understand your point of view. I updated the rqlite-throubleshooting code with your recommendation of having one of the nodes to use a smaller raft-timeout (2s) while the others use a higher timeout (10s), but I couldn't notice any difference regarding making it more stable, it seems to be hitting the issue with roughly the same frequency. I'll try different settings to see if it improves.

otoolep · 2024-03-08T19:47:09Z

having one of the nodes to use a smaller raft-timeout (2s) while the others use a higher timeout (10s), but I couldn't notice any difference regarding making it more stable,

That's interesting. I would definitely be interested in seeing the logs from such setup, where the problem happens. The logs should show clearly that the other 2 nodes take much longer to trigger an election.

dwco-z · 2024-03-08T20:07:35Z

Sure, I just hit a snapshot issue in the second run:

node1.log

node3.log

node2.log

otoolep · 2024-03-08T20:17:26Z

Can you try bumping the timeout to 30s on two of the nodes? Run the test again, send me the logs if you hit it.

dwco-z · 2024-03-08T20:44:54Z

node2.log
node3.log
node1.log

otoolep · 2024-03-08T20:51:17Z

OK, thanks, well clearly I don't fully understand the innards of some of the Raft code because that flag is not having the effect I want, at least not for the very first election. The appears to be going straight to candidate.

I'll just need to keep digging into the Raft code.

otoolep mentioned this issue Mar 5, 2024

failed to install snapshot #1711

Closed

otoolep mentioned this issue Mar 5, 2024

Support INFO-level logging of snapshot reaping #1718

Merged

otoolep changed the title ~~Failed to find snapshot after being installed~~ Recovery process can result in old snapshot getting sent to node Mar 6, 2024

otoolep added a commit that referenced this issue Mar 7, 2024

More NodeRecovery logging

1858199

#1717

otoolep added a commit to rqlite/rqlite.io that referenced this issue Mar 8, 2024

Update _index.md

1942a4d

rqlite/rqlite#1717

Recovery process can result in old snapshot getting sent to node #1717

Recovery process can result in old snapshot getting sent to node #1717

Comments

otoolep commented Mar 5, 2024

otoolep commented Mar 5, 2024

dwco-z commented Mar 5, 2024 • edited

otoolep commented Mar 5, 2024

otoolep commented Mar 5, 2024

dwco-z commented Mar 5, 2024

dwco-z commented Mar 5, 2024

otoolep commented Mar 5, 2024

otoolep commented Mar 5, 2024

dwco-z commented Mar 5, 2024

dwco-z commented Mar 6, 2024

otoolep commented Mar 6, 2024

otoolep commented Mar 6, 2024

dwco-z commented Mar 6, 2024

otoolep commented Mar 6, 2024

dwco-z commented Mar 6, 2024

dwco-z commented Mar 6, 2024

otoolep commented Mar 6, 2024

otoolep commented Mar 6, 2024

dwco-z commented Mar 6, 2024

dwco-z commented Mar 6, 2024

dwco-z commented Mar 6, 2024

otoolep commented Mar 6, 2024

otoolep commented Mar 6, 2024

dwco-z commented Mar 6, 2024

otoolep commented Mar 6, 2024

otoolep commented Mar 6, 2024

otoolep commented Mar 6, 2024

dwco-z commented Mar 7, 2024

dwco-z commented Mar 7, 2024

otoolep commented Mar 7, 2024

otoolep commented Mar 7, 2024

dwco-z commented Mar 7, 2024

otoolep commented Mar 8, 2024

otoolep commented Mar 8, 2024

otoolep commented Mar 8, 2024

otoolep commented Mar 8, 2024

otoolep commented Mar 8, 2024

dwco-z commented Mar 8, 2024

otoolep commented Mar 8, 2024

dwco-z commented Mar 8, 2024 • edited

otoolep commented Mar 8, 2024

dwco-z commented Mar 8, 2024

otoolep commented Mar 8, 2024

dwco-z commented Mar 5, 2024 •

edited

dwco-z commented Mar 8, 2024 •

edited