Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept any snapshot that allows replication #110

Merged
merged 2 commits into from
Jan 10, 2024

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Nov 17, 2023

This PR is adopted from #84.


A leader will not take into account snapshots reported by a follower unless they match or exceed the tracked PendingSnapshot index (which is the leader's last indexat the time of requesting the snapshot). This is too inflexible: the leader should take into account any snapshot that reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered in CockroachDB. Unless you create the snapshot immediately and locally when raft emits an MsgSnap, it's difficult/impossible to later synthesize a snapshot at the requested index. It is possible to get one above the requested index which raft always accepted, but CockroachDB delegates snapshots to followers who might be behind on applying the log, and it is awkward to have to wait for log application to send the snapshot just to satisfy an overly strict condition in raft. Additionally, CockroachDB also sends snapshots preemptively when adding a new replica since there are qualitative differences between an initial snapshot and one needed to reconnect to the log and one does not want to wait for raft to round-trip to the follower to realize that a snapshot is needed. In this case, the sent snapshot is commonly behind the PendingSnapshot since the leader transitions the follower into StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker grinaker@cockroachlabs.com
Signed-off-by: Tobias Grieger tobias.b.grieger@gmail.com

@erikgrinaker
Copy link
Contributor Author

cc @pavelkalinnikov @ahrtr

@ahrtr
Copy link
Member

ahrtr commented Nov 20, 2023

CockroachDB delegates snapshots to followers who might be behind on applying the log

I am curious about the design. In raft, data only flows outwards from the leader to other servers. But it seems that data could also flow from a follower to another follower in CockroachDB?

raft.go Outdated Show resolved Hide resolved
Copy link
Contributor

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikgrinaker Thanks for driving this change. Left some improvements suggestions and nits. Looks like there will be some follow-up work depending on what we pick in this PR.

rafttest/interaction_env_handler_process_append_thread.go Outdated Show resolved Hide resolved
testdata/snapshot_reject_via_app_resp_behind.txt Outdated Show resolved Hide resolved
testdata/snapshot_reject_via_app_resp_behind.txt Outdated Show resolved Hide resolved
testdata/snapshot_reject_via_app_resp_behind.txt Outdated Show resolved Hide resolved
Comment on lines 50 to 54
// The follower will transition back to StateReplicate if the leader
// receives an MsgAppResp from it at or above PendingSnapshot, or if
// ResumeReplicateBelowPendingSnapshot is enabled, one that reconnects the
// follower to the leader's log (such an MsgAppResp is emitted when the
// follower applies a snapshot).
Copy link
Contributor

@pav-kv pav-kv Nov 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more complicated than it needs to be, hence bugs. For example, consider this part of the comment: The follower will transition back to StateReplicate if the leader receives an MsgAppResp from it at or above PendingSnapshot. What if by the time we receive a reply >= PendingSnapshot, the log is already truncated beyond this index? Then it's pointless to enter the StateReplicate, and there will be another roundtrip to/from StateSnapshot.

I think we should frame A->B state transitions in terms of the pre-conditions/invariants of state B, and only based on that elaborate the messages in state A that can induce these transitions.

In this case we should aim for something like:

// The leader transitions the follower to StateReplicate upon receiving a
// confirmation that there is no gap between the follower's and the leader's
// log, i.e. when Match >= FirstIndex-1.
//
// In StateSnapshot and StateProbe this generally happens when the leader
// receives a successful MsgAppResp at Index >= FirstIndex-1.

If we frame it this way, then we don't even need to compare against the PendingSnapshot to decide whether to unblock, and we don't need the PendingSnapshot variable in the first place (except maybe for informational purposes).

The only other usage of the PendingShapshot field is this:

pendingSnapshot := pr.PendingSnapshot

but we could do better without it. BecomeProbe assumes that the PendingSnapshot is / will be applied, and optimistically sets Next to after it. It also seemingly assumes that the caller of BecomeProbe knows what they're doing, and that PendingSnapshot is exactly that snapshot that succeeded. We could do all that with just the Next variable (set Next = index + 1 upon sending a snapshot @ index and entering the StateSnapshot).

The pending snapshot API is a bit troublesome because it assumes there is only one snapshot in flight, and it also puts some unspecified responsibility on the ReportSnapshot caller. We should not assume that ReportSnapshot reports the snapshot at PendingSnapshot index, but rather allow the caller to pass the index in to be sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are likely improvements we can make here, and this is also related to cockroachdb/cockroach#87583.

But I'm inclined to make a targeted change for now, gated behind a config flag, and once we're confident in the change and enable it unconditionally we can consider a broader refactor.

raft.go Outdated Show resolved Hide resolved
Comment on lines -1563 to -1566
// TODO(tbg): this code is very similar to the snapshot handling in
// MsgAppResp above. In fact, the code there is more correct than the
// code here and should likely be updated to match (or even better, the
// logic pulled into a newly created Progress state machine handler).
Copy link
Contributor

@pav-kv pav-kv Nov 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO wasn't true, because the code is quite different (we don't check Match/Next indices here whatsoever, or communicate the Index of the snapshot that succeeded - we should!), but I think something still needs to be done here. We should be able to transition from StateSnapshot straight to StateReplicate if the conditions allow for it (see other comments: the condition is that the follower's Match is re-connected to the leader's log). And we should share the same transition code here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't check Match/Next indices here whatsoever, or communicate the Index of the snapshot that succeeded - we should!

Yeah, I think any better logic here is predicated on passing in the snapshot index. See cockroachdb/cockroach#87583.

We should be able to transition from StateSnapshot straight to StateReplicate if the conditions allow for it (see other comments: the condition is that the follower's Match is re-connected to the leader's log).

Without the snapshot index being passed in, I don't think we can know the Match until we receive an MsgAppResp, so moving to StateProbe seems like the right thing to do until that happens.

rafttest/interaction_env_handler.go Outdated Show resolved Hide resolved
tracker/progress.go Outdated Show resolved Hide resolved
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 27, 2023

I am curious about the design. In raft, data only flows outwards from the leader to other servers. But it seems that data could also flow from a follower to another follower in CockroachDB?

That's right, snapshots can be sent from any replica via delegated snapshots. For example, cross-region traffic is often much more expensive than inter-region traffic, so in multi-region clusters we prefer sending snapshots from other replicas in the same region where possible. This commonly happens when moving a replica from one CRDB node to a different node in the same region due to rebalancing.

@erikgrinaker erikgrinaker force-pushed the accept-more-raft-snaps branch 3 times, most recently from 9b71e68 to 226ef1c Compare January 5, 2024 11:12
@erikgrinaker
Copy link
Contributor Author

I think this should be ready for another look now @pav-kv @ahrtr.

@ahrtr
Copy link
Member

ahrtr commented Jan 7, 2024

That's right, snapshots can be sent from any replica via delegated snapshots. For example, cross-region traffic is often much more expensive than inter-region traffic, so in multi-region clusters we prefer sending snapshots from other replicas in the same region where possible.

thx for the info. I understood the motivation now.

But the implementation in this PR is a little strange to me. The behaviour of delegating snapshot is completely outside of raft's control, instead it's just application's behaviour (CockroachDB in this case), but this PR modifies raft to adapt to the behaviour which is outside of raft.

Is it possible to reset the follower's state to StateProbe in such case, something like #123, so that we can remove the option ResumeReplicateBelowPendingSnapshot?

One more not-directly-related comment, once the flag ResumeReplicateBelowPendingSnapshot is enabled, it seems that CockroachDB also needs to ignore the snapshot driven by raft, otherwise the application will send two snapshots in a short period? Won't it complicate the application's design?

@pav-kv
Copy link
Contributor

pav-kv commented Jan 8, 2024

@ahrtr

But the implementation in this PR is a little strange to me. The behaviour of delegating snapshot is completely outside of raft's control, instead it's just application's behaviour (CockroachDB in this case), but this PR modifies raft to adapt to the behaviour which is outside of raft.

This behaviour will not impact users who don't send snapshots outside of raft. Consider the caveat:

Raft code internally assumes that there is always at most one snapshot in flight. This only seems to be true. Even if raft is used the standard way, I'm moderately sure this property can be violated. Raft messages can be duplicated and reordered, and the process can be restarted. Consider the following scenario (which can be driven entirely by raft, without delegated snapshots):

  1. Send a bunch of append messages at indices 100-150 to a follower.
  2. Compact the log at index 150 (for whatever reason, e.g. realize that most followers are up-to-date).
  3. Restart. Realize that the follower is behind, send snapshot at index 150.
  4. Restart. Handle a bunch of appends at indices 150-200.
  5. Realize that the follower is behind, send snapshot at index 200.
    • // NB: now we have appends 100-200, snapshot@150 and snapshot@200 in flight.
  6. The snapshot@150 from step (3) finally arrives at the follower, follower applies it and acks.
  7. The leader whose pendingSnapshot is 200, receives an ack at index 150.
    • It would be nice to switch to Replicate here. Moreover, indices 150-200 are known to be in flight. So we can even resume optimistic replication from index 200 rather than 150, under some conditions.

With this observation in mind, the scenarios similar to what this PR fixes, can actually happen for any user. This PR is therefore a strict improvement.

The question of what's raft- and what's application- specific is tricky. In my opinion, the Probe/Replicate/Snapshot state behaviours are not inherent to raft algorithm itself. These are "flow control" mechanisms of this specific raft implementation, and they don't impact "correctness" of raft. As such, these mechanisms should aim to reflect the needs of the package users (and best be flexible to fit any application).

Is it possible to reset the follower's state to StateProbe in such case, something like #123, so that we can remove the option ResumeReplicateBelowPendingSnapshot?

The only difference between this PR and #123 is the state to which we transition into. This PR translates to StateReplicate, while #123 transitions to StateProbe. The ResumeReplicateBelowPendingSnapshot flag is seemingly orthogonal to the choice of state here. I half-agree that the flag is not needed, if we make a clean and compatible code change.

Could you explain why choose transition to StateProbe though?

I'll explain why transitioning to StateReplicate makes more sense to me. Some reasoning can be found in an earlier comment. Each of StateReplicate, StateProbe, StateSnapshot should have an "invariant". While the invariant is true, the flow stays in this state. When transitioning between states, we should a) make sure the new state's invariant is true, b) if multiple states are eligible, pick the "best" one.

  • The invariant of the StateReplicate is that, according to the latest information (received via messages), the follower's log is "compatible" with the leader's log, i.e. Next >= firstIndex(); and for each "optimistic" index (from Match+1 to Next-1) there is an in-flight append message that isn't rejected.
  • The invariant of the StateProbe is that the follower's log is not compatible with the leader's log, or this compatibility has not been established yet. While in StateProbe, we don't optimistically chain appends.

From these definitions, it makes sense to transition to StateReplicate as soon as its invariant becomes true. That's what this PR does. #123 would unnecessarily hop to StateProbe, which we know will then transition to StateReplicate almost immediately.

As a side note, I think we need to clean up the flow state machine and make these invariants verbose.

@pav-kv
Copy link
Contributor

pav-kv commented Jan 8, 2024

For easing the mental model, think about a snapshot message as a MsgApp with all entries in [0, snapshotIndex]. From this standpoint, any set of in-flight MsgApp and snapshot messages is equivalent to just a set of in-flight MsgApp messages. Snapshots, just like all other messages, can be dropped, duplicated, reordered and delayed.

From this standpoint, we should treat snapshot message sends and replies just the same way we treat MsgApp. We should not assume there is only one such message in flight (except as best effort for avoiding sending too much), and we should gracefully handle acks at indices below the index of the last sent message. This PR is a step in that direction.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 8, 2024

it seems that CockroachDB also needs to ignore the snapshot driven by raft, otherwise the application will send two snapshots in a short period? Won't it complicate the application's design?

CRDB drops the MsgSnap sent from Raft, and handles the snapshot sending itself.

https://github.com/cockroachdb/cockroach/blob/2f7bf071af146e1e238f985884b99b2a6636115a/pkg/kv/kvserver/replica_raft.go#L1799-L1803

The delegated snapshot is initiated when the leader sees the MsgSnap sent from Raft, where the leader can delegate the actual snapshot sending to a follower. However, the follower can't produce a snapshot at any arbitrary index, it will be produced at whatever applied index the follower happens to be at when it receives the snapshot delegation request. This index may lag the leader's PendingSnapshot index. Of course, we could pass the PendingSnapshot index to the follower, and have the follower either wait to catch up or reject it, but this seems unnecessary.

There are also other cases where this can happen -- for example:

Even if raft is used the standard way, I'm moderately sure this property can be violated.

Yeah, I think this can probably happen across leader changes too -- see above. I can write up a test case to confirm.

it makes sense to transition to StateReplicate as soon as its invariant becomes true. That's what this PR does. #123 would unnecessarily hop to StateProbe, which we know will then transition to StateReplicate almost immediately.

I agree with this. We're processing an MsgAppResp here, so we already know the follower's state, and don't have to wait for another MsgApp probe -- we can step straight to StateReplicate.

remove the option ResumeReplicateBelowPendingSnapshot

I'm not particularly attached to this option. We could merge it as-is without the option, and consider a revert or option if we discover any problems with it. There shouldn't be any correctness or availability implications, just possibly additional work in rare cases.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 8, 2024

Yeah, I think this can probably happen across leader changes too -- see above. I can write up a test case to confirm.

This isn't the case, because the follower will reject an MsgSnap from a past term. I think it can happen with CRDB though, because we send and apply snapshots out-of-band.

I think this is what happened in cockroachdb/cockroach#114349

No, I think that was simply an eager initial snapshot that was sent before the leader transitioned to StateSnapshot.

@pav-kv
Copy link
Contributor

pav-kv commented Jan 8, 2024

Yeah, then the scenario in my comment can't happen either.

That's lucky, but unfortunate in some sense too. It doesn't matter which leader term sent a snapshot/append: as long as a snapshot/append "happens after" the follower's state and is recent, it should be ok to accept. E.g. if during a term change an append from previous term is slightly delayed, it still likely adheres to append-only properties etc, so with some safety checks it can be applied, and retries avoided. This gives some room for optimizations, and what enables it is moving safety checks down the stack and relaxing dependency on the "always reject non-matching term" behaviour.

Anyway, my point is still that this PR (and, in broader sense, formalizing and cleaning up the Probe/Replicate/etc states/transitions) benefits all users and shouldn't be considered CRDB specific.

@ahrtr
Copy link
Member

ahrtr commented Jan 8, 2024

The main concern is about adding public user-facing flag. Usually when we add a public flag for new features, e.g. AsyncStorageWrites, it brings new feature/benefit but with risk of introducing regression. It makes perfect sense to add a new public flag for such case, and it's super clear to users as well. "clear" means it's clear when/why to enable or disable it.

But this flag ResumeReplicateBelowPendingSnapshot, is it for a new feature? It seems that it's just safe-guard flag for a fix to a bug? It's not clear for users when/why ( and even how ) to enable/disable it.

I agree that the comment #110 (comment) makes more sense. I like the idea of changing state based on invariants. If we make some enhance and add a flag for it, then it makes more sense.

From implementation perspective, I agree this PR is safe, because the behaviour will keep unchanged as long as users do not enable ResumeReplicateBelowPendingSnapshot. We could let it merged for now if you are planning to eventually remove it and resolve #110 (comment).

Raft code internally assumes that there is always at most one snapshot in flight.

The concerning isn't about guaranteeing at most one snapshot in flight, it's about waste of bandwidth, because a snapshot may be huge in size, e.g a couple of GBs. We should try to avoid sending snapshot as much as possible. [If CockRoachDB enable ResumeReplicateBelowPendingSnapshot, and doesn't take any special action to drop the MsgSnap sent from Raft, it seems that it will almost always send two snapshots in a short period for cases handling a slow follower which lag too much?]

Could you explain why choose transition to StateProbe though?

Based on current raft protocol/implementation, when a follower is in StateSnapshot, when a leader received a not-out-of- date MsgAppResp from the follower, the condition pr.Match >= pr.PendingSnapshot should be always true. Otherwise, there must be something wrong for most users [For CRDB, it may be expected behaviour though due to the delegating snapshot feature], so we'd better to probe the follower's index again? This is just my immature thought.

This adds a test that documents the following behavior:

It the leader is tracking a follower as StateSnapshot with PendingSnapshot equal
to, say, 100, and the follower applies a snapshot at a lower index that
reconnects the follower to the leader's log, then the leader will still ignore
this snapshot.

Signed-off-by: Erik Grinaker <grinaker@cockroachlabs.com>
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 9, 2024

The main concern is about adding public user-facing flag.

I've removed the option.

The concerning isn't about guaranteeing at most one snapshot in flight, it's about waste of bandwidth, because a snapshot may be huge in size, e.g a couple of GBs.

Yes, that's also the motivation for us. In CRDB, we have already sent a snapshot to the follower (below PendingSnapshot), but the leader rejects this snapshot and insists we send a new one above PendingSnapshot. We would like to avoid this.

Based on current raft protocol/implementation, when a follower is in StateSnapshot, when a leader received a not-out-of- date MsgAppResp from the follower, the condition pr.Match >= pr.PendingSnapshot should be always true. Otherwise, there must be something wrong for most users [For CRDB, it may be expected behaviour though due to the delegating snapshot feature], so we'd better to probe the follower's index again?

Well, from etcd/raft's point of view, even if we did receive an out-of-date MsgAppResp that was below PendingSnapshot but above the leader's first log index, then the follower must have advanced in the meanwhile (or the leader's log storage was temporarily unavailable). If the follower has advanced and recovered from needing a snapshot, it seems fine to move it back to StateReplicate.

With StateProbe, we'll eat the cost of an MsgApp roundtrip only to discover that Next: PendingSnapshot + 1 was incorrect, then resume replication at the correct index. Is the motivation for using StateProbe rather than StateReplicate to avoid replicating entries below PendingSnapshot that the follower may already have received from the snapshot in the case of an MsgAppResp race? Are there any other downsides to using StateReplicate?

To be clear, I'd be fine with StateProbe here -- it's also what we use with e.g. ReportSnapshot(). Just curious if there is a compelling reason to prefer it over StateReplicate.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jan 9, 2024

Is the motivation for using StateProbe rather than StateReplicate to avoid replicating entries below PendingSnapshot that the follower may already have received from the snapshot in the case of an MsgAppResp race?

Actually, that can't really happen, because the snapshot does not carry the actual log entries, and may have truncated the log. It's possible that we'll replicate log entries that will be clobbered by the later snapshot, thus wasting work, but that can also happen in the StateProbe case.

@ahrtr
Copy link
Member

ahrtr commented Jan 9, 2024

Overall looks good, thx.

Note that I am planning to release raft 3.6.0, please see #89. Please feel free to add comment under that issue on whether you are going to resolve #110 (comment) in 3.6.0 or next release e.g. 4.0.

Just curious if there is a compelling reason to prefer it over StateReplicate.

No for now.

@erikgrinaker
Copy link
Contributor Author

I think we'll have to do the PendingSnapshot removal and other improvements in 4.0, since they would be breaking changes to the public APIs. Wrote up a comment, and I'll write up a quick issue shortly.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Leave to @pav-kv to take a second look

A leader will not take into account snapshots reported by a follower
unless they match or exceed the tracked PendingSnapshot index (which is
the leader's last indexat the time of requesting the snapshot). This is
too inflexible: the leader should take into account any snapshot that
reconnects the follower to its log. This PR makes that change.

In doing so, it addresses long-standing problems that we've encountered
in CockroachDB. Unless you create the snapshot immediately and locally
when raft emits an MsgSnap, it's difficult/impossible to later
synthesize a snapshot at the requested index. It is possible to get one
above the requested index which raft always accepted, but CockroachDB
delegates snapshots to followers who might be behind on applying the
log, and it is awkward to have to wait for log application to send the
snapshot just to satisfy an overly strict condition in raft.
Additionally, CockroachDB also sends snapshots preemptively when adding
a new replica since there are qualitative differences between an initial
snapshot and one needed to reconnect to the log and one does not want to
wait for raft to round-trip to the follower to realize that a snapshot
is needed. In this case, the sent snapshot is commonly behind the
PendingSnapshot since the leader transitions the follower into
StateProbe when a snapshot is already in flight.

Touches cockroachdb/cockroach#84242.
Touches cockroachdb/cockroach#87553.
Touches cockroachdb/cockroach#87554.
Touches cockroachdb/cockroach#97971.
Touches cockroachdb/cockroach#114349.
See also https://github.com/cockroachdb/cockroach/blob/2b91c3829270eb512c5380201c26a3d838fc567a/pkg/kv/kvserver/raft_snapshot_queue.go#L131-L143.

Signed-off-by: Erik Grinaker <grinaker@cockroachlabs.com>
Signed-off-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@erikgrinaker erikgrinaker changed the title Add option to accept any snapshot that allows replication Accept any snapshot that allows replication Jan 9, 2024
@erikgrinaker
Copy link
Contributor Author

Moved follow-up work to #124

Copy link
Contributor

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

// TODO(tbg): we should also enter this branch if a snapshot is
// received that is below pr.PendingSnapshot but which makes it
// possible to use the log again.
case pr.State == tracker.StateSnapshot && pr.Match+1 >= r.raftLog.firstIndex():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically, it can also be (maybe should?) pr.Next >= r.raftLog.firstIndex().

Next is the next index to send. Indices between Match+1 and Next-1 are in-flight (by invariant which we haven't formalized yet), so don't necessarily need to be present in the log.

The MaybeUpdate() call 4 lines above should set Next to be at least Match+1, so this should be safe.

Feel free to leave this for #124.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it for #124. Semantically, Match+1 seems better since that's where the leader's and follower's logs match, but they're equivalent here and I don't have a particularly strong opinion either way.

@ahrtr ahrtr added this to the v3.6.0 milestone Jan 10, 2024
@ahrtr ahrtr merged commit f1c02c9 into etcd-io:main Jan 10, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants