Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating topology concurrently with global token metadata barrier may cause requests to fail #18699

Closed
tgrabiec opened this issue May 15, 2024 · 0 comments

Comments

@tgrabiec
Copy link
Contributor

Topology version may be updated, for example, by executing a RESTful API call to move a tablet. If that is done concurrently with an ongoing token metadata barrier executed by topology coordinator (because there is active tablet migration, for example), then some requests may fail due to being fenced out unnecessarily.

Problem seen in CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/8806/artifact/testlog/x86_64/debug/scylla-3447.log

Topology coordinator log:

INFO  2024-05-15 06:22:00,811 [shard 0:strm] raft_topology - updating topology state: Tablet migration
INFO  2024-05-15 06:22:00,881 [shard 0:strm] raft_topology - Moving tablet 48d95250-126a-11ef-94f5-1b64c01a2d71:0 from 01333e37-d6f5-4377-8b2c-7a059519426b:1 to a7928125-224f-49de-8eed-0b2f44abefb6:1

^^ Sets topology version to 23

INFO  2024-05-15 06:22:00,938 [shard 0:strm] raft_topology - entered `tablet migration` transition state
INFO  2024-05-15 06:22:00,938 [shard 0:strm] raft_topology - executing global topology command barrier_and_drain, excluded nodes: {}

^^ Barrier for version 23 starts, other nodes are only guaranteed to catch up with version 23

INFO  2024-05-15 06:22:00,946 [shard 0:strm] raft_topology - Moving tablet 48d95250-126a-11ef-94f5-1b64c01a2d71:1 from a7928125-224f-49de-8eed-0b2f44abefb6:1 to 01333e37-d6f5-4377-8b2c-7a059519426b:1

^^ Sets topology version to 24

INFO  2024-05-15 06:22:01,002 [shard 0:strm] raft_topology - updating topology state: advance fence version to 24

^^ Fence is set to version 24 but other nodes may still be at version 23

Request started on another node fails:

ERROR 2024-05-15 06:22:01,070 [shard 1:stmt] storage_proxy - Exception when communicating with 127.119.201.41, to read from test.test: replica::stale_topology_exception (stale topology exception, caller version 23, callee fence version 24)

The problem is that barrier assumes no concurrent topology updates so it sets the fence version to the one which is current after other nodes are drained. We should set the fence to the version which was current before other nodes were drained, as they are only guaranteed to catch up with that version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants