Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery process can result in old snapshot getting sent to node #1717

Open
otoolep opened this issue Mar 5, 2024 · 43 comments
Open

Recovery process can result in old snapshot getting sent to node #1717

otoolep opened this issue Mar 5, 2024 · 43 comments

Comments

@otoolep
Copy link
Member

otoolep commented Mar 5, 2024

Started with #1711

This may be another race between nodes coming up during the Recovery process and pushing various snapshots to each other, with no guarantees when they arrive and are installed -- and the reap removing older snapshots.

@otoolep
Copy link
Member Author

otoolep commented Mar 5, 2024

@dwco-z - this may not be solvable through code, and may require some changes to the Recovery process steps.

@dwco-z
Copy link

dwco-z commented Mar 5, 2024

@otoolep do you mean maybe I should adjust the way I'm doing the recovery? e.g., adding some random timeout during each node's startup to reduce the chances of the race condition?

@otoolep
Copy link
Member Author

otoolep commented Mar 5, 2024

Not sure, perhaps blow away snapshots in every directory except one node. Not advocating this right now, I need to look into the code more.

@otoolep
Copy link
Member Author

otoolep commented Mar 5, 2024

@dwco-z -- maybe you can run the latest code, which will be available once this build completes: https://ci.appveyor.com/project/otoolep/rqlite/builds/49333481

I couldn't reproduce the issue locally, so your help would be good. Re-run your test, but with -raft-log-level=INFO, get the issue to occur, and then attach the logs to this bug.

@dwco-z
Copy link

dwco-z commented Mar 5, 2024

Sure, will do

@dwco-z
Copy link

dwco-z commented Mar 5, 2024

Here it is, I changed from-raft-log-level=DEBUG to -raft-log-level=INFO and captured the issue again:
node2.log
node3.log
node1.log

@otoolep
Copy link
Member Author

otoolep commented Mar 5, 2024

OK, as I suspect, more snapshot reaping going on.

It's not immediately clear to me how to solve this issue. I'll need to think about it.

@otoolep
Copy link
Member Author

otoolep commented Mar 5, 2024

Couple of ideas, for my own reference:

  • dig into why the Leader is sending an old snapshot to the follower
  • consider reaping the snapshot in the future, if a node gets an older snapshot. Need to be cautious, as this could be a sign that something has gone wrong.

@dwco-z
Copy link

dwco-z commented Mar 5, 2024

OK, as I suspect, more snapshot reaping going on.

It's not immediately clear to me how to solve this issue. I'll need to think about it.

Ok, anytime you need me to test changes just let me know. I believe you could not reproduce the bug because of some timing difference as you mentioned before.

@dwco-z
Copy link

dwco-z commented Mar 6, 2024

Just in case you need logs from the last created PR:
node2.log
node3.log
node1.log

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

Thanks, no surprises, what I expected is happening.

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

@dwco-z -- can you run again, using the latest build from the same PR (I've updated the code a little) and run with -raft-log-level=DEBUG? Then send me the logs again -- thanks.

#1719

@dwco-z
Copy link

dwco-z commented Mar 6, 2024

Sure

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

I've actually merged the PR to master, because until we get to the bottom of this, getting an older snapshot breaks rqlite. The Raft paper does reference this possibility (and suggests ways to deal with it), but either the Raft layer itself is sending it incorrectly, or rqlite needs to handle it better. I want to know which it is before removing this check.

@dwco-z
Copy link

dwco-z commented Mar 6, 2024

I've actually merged the PR to master, because until we get to the bottom of this, getting an older snapshot breaks rqlite. The Raft paper does reference this possibility (and suggests ways to deal with it), but either the Raft layer itself is sending it incorrectly, or rqlite needs to handle it better. I want to know which it is before removing this check.

That makes sense. As soon as AppVeyor completes I'll send you the logs.

@dwco-z
Copy link

dwco-z commented Mar 6, 2024

New logs
node2.log
node3.log
node1.log

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

OK, thanks, logs are good to see. However, this issue is quite complex, with a lot of moving parts. Unless I can get it to happen locally, it's going to be difficult to solve. It's very timing-dependent and I don't know if it's a bug in the Raft code, a bug in rqlite code, or just that recovering a cluster like this it at the edge of what Raft and rqlite support (Recovery is not something that is described in the Raft paper).

I'll continue to see if I can reproduce this locally, but unless I do, I do not expect to solve this anytime soon. A more robust Recovery process may simply be to always recover one node first, and then join new nodes to it.

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

@dwco-z -- when you run your script, does it exit automatically when the issue happens? Or are you just watching the logs?

@dwco-z
Copy link

dwco-z commented Mar 6, 2024

@dwco-z -- when you run your script, does it exit automatically when the issue happens? Or are you just watching the logs?

It exit automatically

leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
leader still not ready...
stopping rqlited...
stopped!
stopping rqlited...
stopped!
stopping rqlited...
stopped!
Traceback (most recent call last):
File "C:\Projects\rqlite-throubleshoot\main.py", line 85, in
start_all_instances(all_instances)
File "C:\Projects\rqlite-throubleshoot\main.py", line 55, in start_all_instances
future.result()
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures_base.py", line 401, in __get_result
raise self._exception
File "C:\Users\DayvidCastro\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Projects\rqlite-throubleshoot\main.py", line 49, in start_manager
manager.start_rqlited(joinAddresses)
File "C:\Projects\rqlite-throubleshoot\rqlite_manager.py", line 58, in start_rqlited
raise Exception('something is wrong, leader is not ready...')
Exception: something is wrong, leader is not ready...
→ C:\Projects\rqlite-throubleshoot [rqlite-8.22.1-issue]›

@dwco-z
Copy link

dwco-z commented Mar 6, 2024

I'll try running on my linux machine with python 3.8 to see if I can reproduce it there

@dwco-z
Copy link

dwco-z commented Mar 6, 2024

I pushed some changes to rqlite-throubleshooting to make it compatible with linux, and I was able to hit the issue in a linux mint using the latest release rqlite-v8.22.2-linux-amd64.tar.gz

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

Great, it's happening for me now with your latest script.

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

Took about 1-2 minutes.

@dwco-z
Copy link

dwco-z commented Mar 6, 2024

Nice!

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

Sometimes the script is exiting, but the logs don't show the issue. Just an FYI.

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

node1.log
node2.log
node3.log

@otoolep
Copy link
Member Author

otoolep commented Mar 6, 2024

The logs are helping to understand this issue.

I'm not sure yet, but the RecoverNode documentation (which I got from Consul) does state that uncommitted entries and data loss can result from this process. When the issue happens I am seeing confusing elections taking place, and the logs are not entirely clear -- it could be the side-effects of this process. It may be that the RecoverNode process is not entirely bulletproof, and is a best-effort thing. Even the Raft docs say they've had issues in the past with this area of the code.

I will continue to think about this, and have some ideas for a fix, but I'm reluctant to make changes at this time that affect the core of rqlite for an edge case (which may even be a best effort thing). The only truly robust way to recover a cluster that has suffered a big loss is to take the SQLite database out from under a node (or use a backup) and rebuild the cluster.

@otoolep otoolep changed the title Failed to find snapshot after being installed Recovery process can result in old snapshot getting sent to node Mar 6, 2024
@dwco-z
Copy link

dwco-z commented Mar 7, 2024

Sometimes the script is exiting, but the logs don't show the issue. Just an FYI.

Judging from the node3.log, it seems node3 received a interrupt signal [rqlited] 2024/03/06 15:30:07 received signal "interrupt", but for some reason it cannot stop because it seems to be trying to contact the other two nodes that stopped, so 30 seconds later at 15:30:37 the RqliteManager.stop_rqlited timeouts and fallback to process.terminate() which for some reason didn't stop everything until the next start so the port was still being hold, I guess changing to process.kill() might avoid this

@dwco-z
Copy link

dwco-z commented Mar 7, 2024

One thing I find odd is that although snapshot and readyz?sync are two completely different concepts as you already mentioned, I cannot reproduce without the sync flag (I tried 260 times) while if I make other changes like inserting a random timeout before each node's start, I can still hit the issue eventually

@otoolep
Copy link
Member Author

otoolep commented Mar 7, 2024

Sync simply changes the timing of the whole test substantially, that is my suspicion for the difference. Sync absolutely is a read-only operation, and does not change the operation of a node.

otoolep added a commit that referenced this issue Mar 7, 2024
@otoolep
Copy link
Member Author

otoolep commented Mar 7, 2024

So this is what is happening, according to logs from my latest run of this:

  • All 3 nodes start fresh.
  • Node 1 wins the initial election, increasing its term to 2.
  • Node 2 starts an election too, increases its term 3, and wins the election
  • Node 3 next starts an election, increases its term to 4, and wins the election

So basically all 3 nodes decide to start an election fairly quickly due to -raft-timeout being 2 seconds.

Then the nodes are stopped, and a recovery process is triggered. As part of the Recovery process, each node creates a snapshot.

  • Node 1 creates 2-2-TTTTT as its snapshot where TTTTT is a timestamp. 2-2 comes from the last log index and term in the node's log.
  • Node 2 creates 3-3-TTTTT as its snapshot.
  • Node 3 creates 4-4-TTTTT as its snapshot.

Node 2 then triggers an election, bumps its term to 5, and wins with votes from itself and Node 3 (Node 1 hasn't finished its recovery so doesn't vote). Node 2 starts replication of logs to other nodes (via AppendEntries requests). Node 2 checks if Node 3 has the log at Index 3, which Node 3 doesn't as its compacted its log during the Recovery process. This triggers Node 2 to send its snapshot 3-3-TTTTT to Node 3, which Node 3 rejects as it has a snapshot from a later term, 4-4-TTTTT.

And then we're stuck. This is because rqlite assumes that any snapshot received from another node, with instructions to install that snapshot, will be from index and term at least equal to that of the latest snapshot the receiving node has (this was made explicit on master recently, but was implicit in the Snapshot reaping logic). This is not true on Node 3.

One solution to this issue is to remove this assumption, and allow "older" snapshots to be installed. To allow this the Snapshot-reaping process would need to be changed so that it can reap snapshots both older and newer (in terms of index and term) than the most recently installed snapshot (today it just reaps everything but the "newest in index and term", which wipes out the most recently installed snapshot). The Raft paper indicates that this should be OK because it is Node 2 that is sending the snapshot, Node 2 is leader, and the Raft paper says that Node 2 couldn't become Leader in the first place if it didn't have all the log entries that every other node had too (section 5.4.1). So in that sense it should be "safe" and no data should be lost.

I'll think about this more, but I think Snapshot Reaping needs to be changed from "reap all but the latest snapshot" to "reap all but the most recently installed snapshot". This should solve this problem.

@dwco-z
Copy link

dwco-z commented Mar 7, 2024

Oh, now I can see why the timing is relevant in leading to this series of events, yeah that makes sense

@otoolep
Copy link
Member Author

otoolep commented Mar 8, 2024

Yeah, very timing dependent, that's why it takes a few attempts.

I can't rush into fixing this however. While in principle it's fairly easy to see a possible fix, the hard parts are dealing with crash scenarios. rqlite must handle crashing at anytime, particularly while taking a snapshot. There is important logic that rqlite runs at start-up, checking for inconsistencies. That logic today assumes that the most recent snapshot it finds is the desired one. That assumption could be changed by any fix for this particular issue -- and that might involve even more complicated consistency checking code, which can result in new bugs. Since crashing during snapshotting is much more likely to happen than this issue (rqlited can be killed at anytime by an operator), I have to consider changes very carefully.

@otoolep
Copy link
Member Author

otoolep commented Mar 8, 2024

By crashing I also mean "exiting hard unexpectedly", perhaps due to SIGKILL being sent to rqlited.

otoolep added a commit to rqlite/rqlite.io that referenced this issue Mar 8, 2024
@otoolep
Copy link
Member Author

otoolep commented Mar 8, 2024

I've updated the node-recovery docs due to this issue.

@otoolep
Copy link
Member Author

otoolep commented Mar 8, 2024

@dwco-z -- if my reasoning is correct here, I suspect you'll see much robust behavior if you increase the election timeout, maybe try 5 seconds. We want to reduce the rate of elections during the recovery process.

@otoolep
Copy link
Member Author

otoolep commented Mar 8, 2024

@dwco-z -- correction, increase the election timeout for 2 out of the 3 nodes, thereby ensuring 1 node will be elected much faster than the other two nodes. Make two of the nodes, say, 10 seconds, and leave the other at 2 seconds.

Yes, this is lots of special casing, but Node Recovery, by its nature, is meant to be something done under very controlled conditions.

@dwco-z
Copy link

dwco-z commented Mar 8, 2024

I understand your point of view. I updated the rqlite-throubleshooting code with your recommendation of having one of the nodes to use a smaller raft-timeout (2s) while the others use a higher timeout (10s), but I couldn't notice any difference regarding making it more stable, it seems to be hitting the issue with roughly the same frequency. I'll try different settings to see if it improves.

@otoolep
Copy link
Member Author

otoolep commented Mar 8, 2024

having one of the nodes to use a smaller raft-timeout (2s) while the others use a higher timeout (10s), but I couldn't notice any difference regarding making it more stable,

That's interesting. I would definitely be interested in seeing the logs from such setup, where the problem happens. The logs should show clearly that the other 2 nodes take much longer to trigger an election.

@dwco-z
Copy link

dwco-z commented Mar 8, 2024

Sure, I just hit a snapshot issue in the second run:

node1.log

node3.log

node2.log

@otoolep
Copy link
Member Author

otoolep commented Mar 8, 2024

Can you try bumping the timeout to 30s on two of the nodes? Run the test again, send me the logs if you hit it.

@dwco-z
Copy link

dwco-z commented Mar 8, 2024

node2.log
node3.log
node1.log

@otoolep
Copy link
Member Author

otoolep commented Mar 8, 2024

OK, thanks, well clearly I don't fully understand the innards of some of the Raft code because that flag is not having the effect I want, at least not for the very first election. The appears to be going straight to candidate.

I'll just need to keep digging into the Raft code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants