You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Topology version may be updated, for example, by executing a RESTful API call to move a tablet. If that is done concurrently with an ongoing token metadata barrier executed by topology coordinator (because there is active tablet migration, for example), then some requests may fail due to being fenced out unnecessarily.
INFO 2024-05-15 06:22:00,811 [shard 0:strm] raft_topology - updating topology state: Tablet migration
INFO 2024-05-15 06:22:00,881 [shard 0:strm] raft_topology - Moving tablet 48d95250-126a-11ef-94f5-1b64c01a2d71:0 from 01333e37-d6f5-4377-8b2c-7a059519426b:1 to a7928125-224f-49de-8eed-0b2f44abefb6:1
^^ Sets topology version to 23
INFO 2024-05-15 06:22:00,938 [shard 0:strm] raft_topology - entered `tablet migration` transition state
INFO 2024-05-15 06:22:00,938 [shard 0:strm] raft_topology - executing global topology command barrier_and_drain, excluded nodes: {}
^^ Barrier for version 23 starts, other nodes are only guaranteed to catch up with version 23
INFO 2024-05-15 06:22:00,946 [shard 0:strm] raft_topology - Moving tablet 48d95250-126a-11ef-94f5-1b64c01a2d71:1 from a7928125-224f-49de-8eed-0b2f44abefb6:1 to 01333e37-d6f5-4377-8b2c-7a059519426b:1
^^ Sets topology version to 24
INFO 2024-05-15 06:22:01,002 [shard 0:strm] raft_topology - updating topology state: advance fence version to 24
^^ Fence is set to version 24 but other nodes may still be at version 23
Request started on another node fails:
ERROR 2024-05-15 06:22:01,070 [shard 1:stmt] storage_proxy - Exception when communicating with 127.119.201.41, to read from test.test: replica::stale_topology_exception (stale topology exception, caller version 23, callee fence version 24)
The problem is that barrier assumes no concurrent topology updates so it sets the fence version to the one which is current after other nodes are drained. We should set the fence to the version which was current before other nodes were drained, as they are only guaranteed to catch up with that version.
The text was updated successfully, but these errors were encountered:
Topology version may be updated, for example, by executing a RESTful API call to move a tablet. If that is done concurrently with an ongoing token metadata barrier executed by topology coordinator (because there is active tablet migration, for example), then some requests may fail due to being fenced out unnecessarily.
Problem seen in CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/8806/artifact/testlog/x86_64/debug/scylla-3447.log
Topology coordinator log:
Request started on another node fails:
The problem is that barrier assumes no concurrent topology updates so it sets the fence version to the one which is current after other nodes are drained. We should set the fence to the version which was current before other nodes were drained, as they are only guaranteed to catch up with that version.
The text was updated successfully, but these errors were encountered: