Ctrl+C blocked during snapshot transfer & node restore #1695

aderouineau · 2024-02-23T02:03:24Z

What version are you running?

8.21.1

Are you using Docker or Kubernetes to run your system?

No

Are you running a single node or a cluster?

1 node + 1 reader

What did you do?

Launched first node with 10+ GB SQLite DB
Launched read-only node
Tried to Ctrl+C during snapshot transfer.

What did you expect to happen?

I expected rqlited to exit, aborting the snapshot transfer.

What happened instead?

It stopped its own HTTP daemon but continued with the snapshot transfer and only exited when the transfer was complete and the node restored.

[rqlited] 2024/02/23 01:54:49 connect using the command-line tool via 'rqlite -H localhost -p 4011'
[raft] 2024/02/23 01:54:49 [WARN]  failed to get previous log: previous-index=10 last-index=0 error="log not found"
[raft] 2024/02/23 01:54:59 [INFO]  snapshot network transfer progress: read-bytes=213843968 percent-complete="1.61%"
...
[raft] 2024/02/23 01:55:39 [INFO]  snapshot network transfer progress: read-bytes=1186398208 percent-complete="8.92%"
^C[rqlited] 2024/02/23 01:55:40 received signal "interrupt"
[http] 2024/02/23 01:55:40 closing HTTP service on 127.0.0.1:4011
[http] 2024/02/23 01:55:40 HTTP service on 127.0.0.1:4011 stopped: http: Server closed
[raft] 2024/02/23 01:55:49 [INFO]  snapshot network transfer progress: read-bytes=1546649600 percent-complete="11.63%"
...
[raft] 2024/02/23 02:01:06 [INFO]  copied to local snapshot: bytes=13301739520
[store] 2024/02/23 02:01:06 initiating node restore on node ID 2
[raft] 2024/02/23 02:01:16 [INFO]  snapshot restore progress: id=35-7-1708653289498 last-index=7 last-term=35 size-in-bytes=13301739520 read-bytes=6834028544 percent-complete="51.38%"
^C[rqlited] 2024/02/23 02:01:21 received signal "interrupt"
[raft] 2024/02/23 02:01:26 [INFO]  snapshot restore progress: id=35-7-1708653289498 last-index=7 last-term=35 size-in-bytes=13301739520 read-bytes=9677799424 percent-complete="72.76%"
[raft] 2024/02/23 02:01:36 [INFO]  snapshot restore progress: id=35-7-1708653289498 last-index=7 last-term=35 size-in-bytes=13301739520 read-bytes=12529172480 percent-complete="94.19%"
[store] 2024/02/23 02:01:38 successfully opened database at /home/derouin/rqlite/data2/db.sqlite due to restore
[store] 2024/02/23 02:01:38 node restored in 32.769750195s
[raft] 2024/02/23 02:01:38 [INFO]  Installed remote snapshot
[store] 2024/02/23 02:01:38 store closed with node ID 2, listening on localhost:4012
2024-02-23T02:01:38.939Z [ERROR] raft-net: failed to decode incoming command: error="transport shutdown"
[rqlited] 2024/02/23 02:01:38 rqlite server stopped

The text was updated successfully, but these errors were encountered:

mauri870 · 2024-02-28T00:19:28Z

This seems to be a limitation of raft itself, whenever you call raft.Shutdown() it returns a future, that you can then use to wait for it to finish. Inside the raft implementation it does not check for shutdown during the network transfer.

We could simply not wait on the Shutdown future, but I imagine this can lead to some buggy behavior.

Imho this is working as intended.

aderouineau · 2024-02-28T00:36:53Z

Shouldn't an issue be raised with Hashicorp's raft then?

mauri870 · 2024-02-28T00:41:05Z

@otoolep Could this network transfer be done in a future in raft, so we can check for shutdown while copying?

Smth along the lines of this

diff --git a/raft.go b/raft.go
index 28c1128..73d85c9 100644
--- a/raft.go
+++ b/raft.go
@@ -1755,32 +1755,49 @@ func (r *Raft) installSnapshot(rpc RPC, req *InstallSnapshotRequest) {
 	countingRPCReader := newCountingReader(rpc.Reader)
 
 	// Spill the remote snapshot to disk
-	transferMonitor := startSnapshotRestoreMonitor(r.logger, countingRPCReader, req.Size, true)
-	n, err := io.Copy(sink, countingRPCReader)
-	transferMonitor.StopAndWait()
-	if err != nil {
-		sink.Cancel()
-		r.logger.Error("failed to copy snapshot", "error", err)
-		rpcErr = err
-		return
-	}
+	diskCopyErrCh := make(chan error, 1)
+	// NOTE: Maybe this is a good candidate for a future?
+	go func() {
+		transferMonitor := startSnapshotRestoreMonitor(r.logger, countingRPCReader, req.Size, true)
+		n, err := io.Copy(sink, countingRPCReader)
+		transferMonitor.StopAndWait()
+		if err != nil {
+			sink.Cancel()
+			r.logger.Error("failed to copy snapshot", "error", err)
+			diskCopyErrCh <- err
+			return
+		}
 
-	// Check that we received it all
-	if n != req.Size {
-		sink.Cancel()
-		r.logger.Error("failed to receive whole snapshot",
-			"received", hclog.Fmt("%d / %d", n, req.Size))
-		rpcErr = fmt.Errorf("short read")
-		return
-	}
+		// Check that we received it all
+		if n != req.Size {
+			sink.Cancel()
+			r.logger.Error("failed to receive whole snapshot",
+				"received", hclog.Fmt("%d / %d", n, req.Size))
+			diskCopyErrCh <- fmt.Errorf("short read")
+			return
+		}
 
-	// Finalize the snapshot
-	if err := sink.Close(); err != nil {
-		r.logger.Error("failed to finalize snapshot", "error", err)
-		rpcErr = err
+		// Finalize the snapshot
+		if err := sink.Close(); err != nil {
+			r.logger.Error("failed to finalize snapshot", "error", err)
+			diskCopyErrCh <- err
+			return
+		}
+		r.logger.Info("copied to local snapshot", "bytes", n)
+		diskCopyErrCh <- nil
+	}()
+
+	// Wait for snapshot copy or shutdown
+	select {
+	case err := <-diskCopyErrCh:
+		if err != nil {
+			rpcErr = err
+			return
+		}
+	case <-r.shutdownCh:
+		// future.respond(ErrRaftShutdown)
 		return
 	}
-	r.logger.Info("copied to local snapshot", "bytes", n)
 
 	// Restore snapshot
 	future := &restoreFuture{ID: sink.ID()}

otoolep · 2024-02-28T01:00:23Z

Maybe, sure, but obviously we'd need to work with the Raft maintainers to do that.

mauri870 · 2024-02-28T01:06:58Z

Thanks, I just wanted to know if that was feasible. We should raise an issue in the hashicorp/raft repo and discuss with the maintainers there.

mauri870 mentioned this issue Feb 28, 2024

Allow network transfer of snapshot to be cancelled during shutdown hashicorp/raft#585

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ctrl+C blocked during snapshot transfer & node restore #1695

Ctrl+C blocked during snapshot transfer & node restore #1695

aderouineau commented Feb 23, 2024

mauri870 commented Feb 28, 2024

aderouineau commented Feb 28, 2024

mauri870 commented Feb 28, 2024 •

edited

otoolep commented Feb 28, 2024

mauri870 commented Feb 28, 2024

Ctrl+C blocked during snapshot transfer & node restore #1695

Ctrl+C blocked during snapshot transfer & node restore #1695

Comments

aderouineau commented Feb 23, 2024

mauri870 commented Feb 28, 2024

aderouineau commented Feb 28, 2024

mauri870 commented Feb 28, 2024 • edited

otoolep commented Feb 28, 2024

mauri870 commented Feb 28, 2024

mauri870 commented Feb 28, 2024 •

edited