Updating topology concurrently with global token metadata barrier may cause requests to fail #18699

tgrabiec · 2024-05-15T22:09:32Z

Topology version may be updated, for example, by executing a RESTful API call to move a tablet. If that is done concurrently with an ongoing token metadata barrier executed by topology coordinator (because there is active tablet migration, for example), then some requests may fail due to being fenced out unnecessarily.

Problem seen in CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/8806/artifact/testlog/x86_64/debug/scylla-3447.log

Topology coordinator log:

INFO  2024-05-15 06:22:00,811 [shard 0:strm] raft_topology - updating topology state: Tablet migration
INFO  2024-05-15 06:22:00,881 [shard 0:strm] raft_topology - Moving tablet 48d95250-126a-11ef-94f5-1b64c01a2d71:0 from 01333e37-d6f5-4377-8b2c-7a059519426b:1 to a7928125-224f-49de-8eed-0b2f44abefb6:1

^^ Sets topology version to 23

INFO  2024-05-15 06:22:00,938 [shard 0:strm] raft_topology - entered `tablet migration` transition state
INFO  2024-05-15 06:22:00,938 [shard 0:strm] raft_topology - executing global topology command barrier_and_drain, excluded nodes: {}

^^ Barrier for version 23 starts, other nodes are only guaranteed to catch up with version 23

INFO  2024-05-15 06:22:00,946 [shard 0:strm] raft_topology - Moving tablet 48d95250-126a-11ef-94f5-1b64c01a2d71:1 from a7928125-224f-49de-8eed-0b2f44abefb6:1 to 01333e37-d6f5-4377-8b2c-7a059519426b:1

^^ Sets topology version to 24

INFO  2024-05-15 06:22:01,002 [shard 0:strm] raft_topology - updating topology state: advance fence version to 24

^^ Fence is set to version 24 but other nodes may still be at version 23

Request started on another node fails:

ERROR 2024-05-15 06:22:01,070 [shard 1:stmt] storage_proxy - Exception when communicating with 127.119.201.41, to read from test.test: replica::stale_topology_exception (stale topology exception, caller version 23, callee fence version 24)

The problem is that barrier assumes no concurrent topology updates so it sets the fence version to the one which is current after other nodes are drained. We should set the fence to the version which was current before other nodes were drained, as they are only guaranteed to catch up with that version.

The text was updated successfully, but these errors were encountered:

tgrabiec added area/topology changes area/tablets labels May 15, 2024

tgrabiec self-assigned this May 15, 2024

tgrabiec mentioned this issue May 15, 2024

Balance tablets within nodes (intra-node migration) #18026

Merged

scylladb-promoter closed this as completed in fad6c41 May 20, 2024

scylladb-promoter added the Backport candidate label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating topology concurrently with global token metadata barrier may cause requests to fail #18699

Updating topology concurrently with global token metadata barrier may cause requests to fail #18699

tgrabiec commented May 15, 2024

Updating topology concurrently with global token metadata barrier may cause requests to fail #18699

Updating topology concurrently with global token metadata barrier may cause requests to fail #18699

Comments

tgrabiec commented May 15, 2024