New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ctrl+C blocked during snapshot transfer & node restore #1695
Comments
This seems to be a limitation of raft itself, whenever you call We could simply not wait on the Shutdown future, but I imagine this can lead to some buggy behavior. Imho this is working as intended. |
Shouldn't an issue be raised with Hashicorp's raft then? |
@otoolep Could this network transfer be done in a future in raft, so we can check for shutdown while copying? Smth along the lines of this diff --git a/raft.go b/raft.go
index 28c1128..73d85c9 100644
--- a/raft.go
+++ b/raft.go
@@ -1755,32 +1755,49 @@ func (r *Raft) installSnapshot(rpc RPC, req *InstallSnapshotRequest) {
countingRPCReader := newCountingReader(rpc.Reader)
// Spill the remote snapshot to disk
- transferMonitor := startSnapshotRestoreMonitor(r.logger, countingRPCReader, req.Size, true)
- n, err := io.Copy(sink, countingRPCReader)
- transferMonitor.StopAndWait()
- if err != nil {
- sink.Cancel()
- r.logger.Error("failed to copy snapshot", "error", err)
- rpcErr = err
- return
- }
+ diskCopyErrCh := make(chan error, 1)
+ // NOTE: Maybe this is a good candidate for a future?
+ go func() {
+ transferMonitor := startSnapshotRestoreMonitor(r.logger, countingRPCReader, req.Size, true)
+ n, err := io.Copy(sink, countingRPCReader)
+ transferMonitor.StopAndWait()
+ if err != nil {
+ sink.Cancel()
+ r.logger.Error("failed to copy snapshot", "error", err)
+ diskCopyErrCh <- err
+ return
+ }
- // Check that we received it all
- if n != req.Size {
- sink.Cancel()
- r.logger.Error("failed to receive whole snapshot",
- "received", hclog.Fmt("%d / %d", n, req.Size))
- rpcErr = fmt.Errorf("short read")
- return
- }
+ // Check that we received it all
+ if n != req.Size {
+ sink.Cancel()
+ r.logger.Error("failed to receive whole snapshot",
+ "received", hclog.Fmt("%d / %d", n, req.Size))
+ diskCopyErrCh <- fmt.Errorf("short read")
+ return
+ }
- // Finalize the snapshot
- if err := sink.Close(); err != nil {
- r.logger.Error("failed to finalize snapshot", "error", err)
- rpcErr = err
+ // Finalize the snapshot
+ if err := sink.Close(); err != nil {
+ r.logger.Error("failed to finalize snapshot", "error", err)
+ diskCopyErrCh <- err
+ return
+ }
+ r.logger.Info("copied to local snapshot", "bytes", n)
+ diskCopyErrCh <- nil
+ }()
+
+ // Wait for snapshot copy or shutdown
+ select {
+ case err := <-diskCopyErrCh:
+ if err != nil {
+ rpcErr = err
+ return
+ }
+ case <-r.shutdownCh:
+ // future.respond(ErrRaftShutdown)
return
}
- r.logger.Info("copied to local snapshot", "bytes", n)
// Restore snapshot
future := &restoreFuture{ID: sink.ID()}
|
Maybe, sure, but obviously we'd need to work with the Raft maintainers to do that. |
Thanks, I just wanted to know if that was feasible. We should raise an issue in the hashicorp/raft repo and discuss with the maintainers there. |
What version are you running?
8.21.1
Are you using Docker or Kubernetes to run your system?
No
Are you running a single node or a cluster?
1 node + 1 reader
What did you do?
What did you expect to happen?
I expected rqlited to exit, aborting the snapshot transfer.
What happened instead?
It stopped its own HTTP daemon but continued with the snapshot transfer and only exited when the transfer was complete and the node restored.
The text was updated successfully, but these errors were encountered: