Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft: fix out-of-order terms #47

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pav-kv
Copy link
Contributor

@pav-kv pav-kv commented Apr 14, 2023

This commit adds a couple invariant checks which were silently ignored and resulted in incorrectly setup tests passing, and incorrect messages exchanged.

Property 1: for a MsgApp message m, m.Entries[i].Term >= m.LogTerm. Or, more generally, m.Entries is a slice of contiguous entries with a non-decreasing Term, and m.Index+1 == m.Entries[0].Index && m.LogTerm <= m.Entries[0].Term.

This property was broken in a few tests. The root cause was that leader appends out-of-order entries to its own log when it becomeLeader(). This was a result of incorrectly set up tests: they restored to a snapshot at (index=11,term=11), but became leader at an earlier term 1. Then it was sending the following, obviously incorrect, MsgApp: {Index=11,LogTerm=11,Entries={{Index=12,Term=1}}.

The fix in tests is either going through a follower state at the correct term, by calling becomeFollower(term, ...), or initializing from a correct HardState in storage.

Property 2: For a MsgSnap message m, m.Term >= m.Snapshot.Metadata.Term, because the leader doesn't know of any higher term than its own, and hence can't send a message with such an inversion.

This was broken in TestRestoreFromSnapMsg, and is now fixed.

This commit adds a couple invariant checks which were silently ignored and
resulted in incorrectly setup tests passing, and incorrect messages exchanged.

Property 1: for a MsgApp message m, m.Entries[i].Term >= m.LogTerm. Or, more
generally, m.Entries is a slice of contiguous entries with a non-decreasing
Term, and m.Index+1 == m.Entries[0].Index && m.LogTerm <= m.Entries[0].Term.

This property was broken in a few tests. The root cause was that leader appends
out-of-order entries to its own log when it becomeLeader(). This was a result
of incorrectly set up tests: they restored to a snapshot at (index=11,term=11),
but became leader at an earlier term 1. Then it was sending the following,
obviously incorrect, MsgApp: {Index=11,LogTerm=11,Entries={{Index=12,Term=1}}.

The fix in tests is either going through a follower state at the correct term,
by calling becomeFollower(term, ...), or initializing from a correct HardState
in storage.

Property 2: For a MsgSnap message m, m.Term >= m.Snapshot.Metadata.Term,
because the leader doesn't know of any higher term than its own, and hence
can't send a message with such an inversion.

This was broken in TestRestoreFromSnapMsg, and is now fixed.

Signed-off-by: Pavel Kalinnikov <pavel@cockroachlabs.com>
@pav-kv pav-kv force-pushed the fix-out-of-order-term-in-tests branch from 9ce72e6 to 0b198cf Compare April 14, 2023 15:40
@pav-kv
Copy link
Contributor Author

pav-kv commented Apr 14, 2023

@tbg @ahrtr PTAL

@@ -757,7 +757,10 @@ func (r *raft) reset(term uint64) {
}

func (r *raft) appendEntry(es ...pb.Entry) (accepted bool) {
li := r.raftLog.lastIndex()
li, lt := r.raftLog.tip()
if r.Term < lt {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after the first successful appendEntry lt == r.Term, because es[i].Term = r.Term on line 765. So, I don't think this check should be here. It probably should happen at the end of raft.newRaft().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

li := r.raftLog.lastIndex()
li, lt := r.raftLog.tip()
if r.Term < lt {
r.logger.Panicf("%x appending out-of-order term: %d < %d", r.id, r.Term, lt)
Copy link

@lavacat lavacat Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you aren't checking Terms in es, so msg is a bit confusing. Maybe appending out-of-order term -> last raft log entry has term greater than raft struct

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using warning instead of panic?

@@ -1366,11 +1366,11 @@ func TestHandleHeartbeat(t *testing.T) {
// TestHandleHeartbeatResp ensures that we re-send log entries when we get a heartbeat response.
func TestHandleHeartbeatResp(t *testing.T) {
storage := newTestMemoryStorage(withPeers(1, 2))
storage.SetHardState(pb.HardState{Term: 3, Vote: 1, Commit: 3})
storage.Append([]pb.Entry{{Index: 1, Term: 1}, {Index: 2, Term: 2}, {Index: 3, Term: 3}})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these entries necessary for the test? Maybe both storage.SetHardState and storage.Append can be removed to keep things simpler

@tbg tbg self-requested a review April 26, 2023 13:10
@tbg tbg removed their request for review June 20, 2023 09:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants