Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ctrl+C blocked during snapshot transfer & node restore #1695

Open
aderouineau opened this issue Feb 23, 2024 · 5 comments
Open

Ctrl+C blocked during snapshot transfer & node restore #1695

aderouineau opened this issue Feb 23, 2024 · 5 comments

Comments

@aderouineau
Copy link

What version are you running?

8.21.1

Are you using Docker or Kubernetes to run your system?

No

Are you running a single node or a cluster?

1 node + 1 reader

What did you do?

  1. Launched first node with 10+ GB SQLite DB
  2. Launched read-only node
  3. Tried to Ctrl+C during snapshot transfer.

What did you expect to happen?

I expected rqlited to exit, aborting the snapshot transfer.

What happened instead?

It stopped its own HTTP daemon but continued with the snapshot transfer and only exited when the transfer was complete and the node restored.

[rqlited] 2024/02/23 01:54:49 connect using the command-line tool via 'rqlite -H localhost -p 4011'
[raft] 2024/02/23 01:54:49 [WARN]  failed to get previous log: previous-index=10 last-index=0 error="log not found"
[raft] 2024/02/23 01:54:59 [INFO]  snapshot network transfer progress: read-bytes=213843968 percent-complete="1.61%"
...
[raft] 2024/02/23 01:55:39 [INFO]  snapshot network transfer progress: read-bytes=1186398208 percent-complete="8.92%"
^C[rqlited] 2024/02/23 01:55:40 received signal "interrupt"
[http] 2024/02/23 01:55:40 closing HTTP service on 127.0.0.1:4011
[http] 2024/02/23 01:55:40 HTTP service on 127.0.0.1:4011 stopped: http: Server closed
[raft] 2024/02/23 01:55:49 [INFO]  snapshot network transfer progress: read-bytes=1546649600 percent-complete="11.63%"
...
[raft] 2024/02/23 02:01:06 [INFO]  copied to local snapshot: bytes=13301739520
[store] 2024/02/23 02:01:06 initiating node restore on node ID 2
[raft] 2024/02/23 02:01:16 [INFO]  snapshot restore progress: id=35-7-1708653289498 last-index=7 last-term=35 size-in-bytes=13301739520 read-bytes=6834028544 percent-complete="51.38%"
^C[rqlited] 2024/02/23 02:01:21 received signal "interrupt"
[raft] 2024/02/23 02:01:26 [INFO]  snapshot restore progress: id=35-7-1708653289498 last-index=7 last-term=35 size-in-bytes=13301739520 read-bytes=9677799424 percent-complete="72.76%"
[raft] 2024/02/23 02:01:36 [INFO]  snapshot restore progress: id=35-7-1708653289498 last-index=7 last-term=35 size-in-bytes=13301739520 read-bytes=12529172480 percent-complete="94.19%"
[store] 2024/02/23 02:01:38 successfully opened database at /home/derouin/rqlite/data2/db.sqlite due to restore
[store] 2024/02/23 02:01:38 node restored in 32.769750195s
[raft] 2024/02/23 02:01:38 [INFO]  Installed remote snapshot
[store] 2024/02/23 02:01:38 store closed with node ID 2, listening on localhost:4012
2024-02-23T02:01:38.939Z [ERROR] raft-net: failed to decode incoming command: error="transport shutdown"
[rqlited] 2024/02/23 02:01:38 rqlite server stopped
@mauri870
Copy link
Contributor

This seems to be a limitation of raft itself, whenever you call raft.Shutdown() it returns a future, that you can then use to wait for it to finish. Inside the raft implementation it does not check for shutdown during the network transfer.

We could simply not wait on the Shutdown future, but I imagine this can lead to some buggy behavior.

Imho this is working as intended.

@aderouineau
Copy link
Author

Shouldn't an issue be raised with Hashicorp's raft then?

@mauri870
Copy link
Contributor

mauri870 commented Feb 28, 2024

@otoolep Could this network transfer be done in a future in raft, so we can check for shutdown while copying?

Smth along the lines of this

diff --git a/raft.go b/raft.go
index 28c1128..73d85c9 100644
--- a/raft.go
+++ b/raft.go
@@ -1755,32 +1755,49 @@ func (r *Raft) installSnapshot(rpc RPC, req *InstallSnapshotRequest) {
 	countingRPCReader := newCountingReader(rpc.Reader)
 
 	// Spill the remote snapshot to disk
-	transferMonitor := startSnapshotRestoreMonitor(r.logger, countingRPCReader, req.Size, true)
-	n, err := io.Copy(sink, countingRPCReader)
-	transferMonitor.StopAndWait()
-	if err != nil {
-		sink.Cancel()
-		r.logger.Error("failed to copy snapshot", "error", err)
-		rpcErr = err
-		return
-	}
+	diskCopyErrCh := make(chan error, 1)
+	// NOTE: Maybe this is a good candidate for a future?
+	go func() {
+		transferMonitor := startSnapshotRestoreMonitor(r.logger, countingRPCReader, req.Size, true)
+		n, err := io.Copy(sink, countingRPCReader)
+		transferMonitor.StopAndWait()
+		if err != nil {
+			sink.Cancel()
+			r.logger.Error("failed to copy snapshot", "error", err)
+			diskCopyErrCh <- err
+			return
+		}
 
-	// Check that we received it all
-	if n != req.Size {
-		sink.Cancel()
-		r.logger.Error("failed to receive whole snapshot",
-			"received", hclog.Fmt("%d / %d", n, req.Size))
-		rpcErr = fmt.Errorf("short read")
-		return
-	}
+		// Check that we received it all
+		if n != req.Size {
+			sink.Cancel()
+			r.logger.Error("failed to receive whole snapshot",
+				"received", hclog.Fmt("%d / %d", n, req.Size))
+			diskCopyErrCh <- fmt.Errorf("short read")
+			return
+		}
 
-	// Finalize the snapshot
-	if err := sink.Close(); err != nil {
-		r.logger.Error("failed to finalize snapshot", "error", err)
-		rpcErr = err
+		// Finalize the snapshot
+		if err := sink.Close(); err != nil {
+			r.logger.Error("failed to finalize snapshot", "error", err)
+			diskCopyErrCh <- err
+			return
+		}
+		r.logger.Info("copied to local snapshot", "bytes", n)
+		diskCopyErrCh <- nil
+	}()
+
+	// Wait for snapshot copy or shutdown
+	select {
+	case err := <-diskCopyErrCh:
+		if err != nil {
+			rpcErr = err
+			return
+		}
+	case <-r.shutdownCh:
+		// future.respond(ErrRaftShutdown)
 		return
 	}
-	r.logger.Info("copied to local snapshot", "bytes", n)
 
 	// Restore snapshot
 	future := &restoreFuture{ID: sink.ID()}

@otoolep
Copy link
Member

otoolep commented Feb 28, 2024

Maybe, sure, but obviously we'd need to work with the Raft maintainers to do that.

@mauri870
Copy link
Contributor

Thanks, I just wanted to know if that was feasible. We should raise an issue in the hashicorp/raft repo and discuss with the maintainers there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants