raft topology: a node finishes bootstrapping and "completes initialization" while other nodes still didn't learn that it's normal #18678

kbr-scylla · 2024-05-14T15:10:24Z

I actually don't know if this is a bug and if it should be considered an issue. But we can use this issue to discuss and potentially decide that it is fixed.

A dtest failed in debug mode: update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_add_two_nodes_in_parallel[case_1]
attaching logs:
dtest-gw2.log
node1.log
node2.log
node3.log

In short, the node bootstrapped 3 nodes. After confirming that they all saw each other as alive and all printed "initialization completed" it connects to node2 (exclusive CQL connection) and tries to do a query with CL=THREE.

The test failed because the query returned unavailable exception:

E cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl THREE. Requires 3, alive 2" info={'consistency': 'THREE', 'required_replicas': 3, 'alive_replicas': 2}

First of all, if I understand correctly how this exception is constructed, the printed numbers are actually wrong if there is a bootstrapping node and we hit a key that is replicated to a pending replica. If there is a bootstrapping node, then in the case of cl=THREE it should say "Requires 4" because we increase the CL by 1.

I didn't confirm 100% that my explanation below is the cause of this failure, but I can see that my explanation is something that could happen, and it is a plausible explanation for this failure, and I have no other hypotheses -- so that's probably it.

The explanation is that even though node 3 printed "initialization completed", node 2 still thinks it is bootstrapping, so token_metadata contains tokens of node 3 as pending. This can happen because we don't do a global read barrier between leaving write_both_read_new and the bootstrapping node printing "initialization completed". So those last commands that update topology state might not have yet reached node 2.

Therefore, at the moment of the test trying to do a CL=THREE query through node 2, the node thinks that there is a pending replica, so it requires 4 alive nodes for that query.

Also note that recently we removed "wait for gossip to settle" before printing "initialization completed" and opening CQL port (65cfb9b) -- this would actually prevent this failure from happening. (And from what I see, this test only started failing recently, so that could be the cause.)

The question is:

should this be considered a bug, which we could fix by e.g. adding a new transition state at the end of bootstrap procedure, which only does a global read barrier? If I remember correctly, @tgrabiec proposed a state like that (use_new) when writing down spec for consistent topology changes
or should we ignore it and just adjust the test to wait... but there's the question, what should the test wait for exactly? And note that this is a problem not only for tests -- it is a problem for any kind of automation that sets up clusters. On the other hand, I'm not sure if anybody uses CLs like TWO or THREE, most should use QUORUM or ALL. It could potentially happen with CL=ONE, RF=1, for example, if I understand correctly, and that's more realistic. So we could expect issue reports from Scylla Operator team and others if we don't fix it.

@kostja @gleb-cloudius @tgrabiec -- opinions?

The text was updated successfully, but these errors were encountered:

tgrabiec · 2024-05-14T15:30:41Z

The explanation is that even though node 3 printed "initialization completed", node 2 still thinks it is bootstrapping, so token_metadata contains tokens of node 3 as pending. This can happen because we don't do a global read barrier between leaving write_both_read_new and the bootstrapping node printing "initialization completed". So those last commands that update topology state might not have yet reached node 2.

Therefore, at the moment of the test trying to do a CL=THREE query through node 2, the node thinks that there is a pending replica, so it requires 4 alive nodes for that query. (CL=ALL would work.)

If node 3 is pending according to erm, why would coordinator require 4 nodes? In that topology version there are 2 normal replicas + 1 pending.

tgrabiec · 2024-05-14T15:41:43Z

Ok I see, it's because CL=3 is supposed to fail during bootstrap of the third node, unlike RF=3 which works fine.

In test, you could wait on node2 for topology to quiesce. I added an API for that in 190bdc3 (not merged yet).

kostja · 2024-05-14T20:27:45Z

I believe this should be fixed. I stated this on earlier occasions, ScyllaDB should actively cooperate with the clients during bootstrap on its bootstrap status, and it should only begin serving queries after all the reasonable efforts were made to get up to date with the cluster and make the cluster aware of this node.

Indeed it's impossible to guarantee in all cases that a starting node can achieve a desired state, but the goal is to avoid errors like this one in a healthy network with healthy nodes.

Issuing an extra read barrier therefore seems to be a reasonable effort - it should not be mandatory for becoming live, perhaps should time-out with a warning in the log after 5 seconds.

tgrabiec · 2024-05-14T21:37:49Z

@kostja Sounds reasonable to me. Maybe except 5 sec timeout, which may be not enough in our CI environment. It could be a regular barrier, we do several of them already so one more won't make us more vulnerable to unavailability.

gleb-cloudius · 2024-05-15T08:30:24Z

Therefore, at the moment of the test trying to do a CL=THREE query through node 2, the node thinks that there is a pending replica, so it requires 4 alive nodes for that query. (CL=ALL would work.)

Why would CL=ALL work?

kbr-scylla · 2024-05-15T11:19:31Z

Why would CL=ALL work?

Sorry, I think I made a mistake. With RF=3 + CL=ALL, we would calculate that 3 natural replicas are needed, (the calculation doesn't take into account how many token owners there are in the ring, it looks at keyspace configuration instead), therefore during bootstrap when hitting a pending range, we'd require 3 natural + 1 pending, 4 in total, so CL=ALL should also fail.

Will edit the post and remove that part (edit: done)

It could be a regular barrier, we do several of them already so one more won't make us more vulnerable to unavailability.

It could work like that:

introduce a new transition_state after write_both_read_new, e.g., use_new,
in use_new tokens of booting node are already normal
the only thing we do in this state is a global barrier, then finish the saga (i.e., remove transition_state, and mark topology request as completed)
therefore, the bootstrapping node which is waiting for request to complete, would only print initialization completed after all nodes learned that its tokens are normal.

Or do you have something else (maybe simpler) in mind @tgrabiec?

kbr-scylla · 2024-05-15T11:20:55Z

I added an API for that in 190bdc3 (not merged yet).

Looks suspiciously similar to "wait for gossip to settle"... which we disabled...

kbr-scylla · 2024-05-15T11:23:08Z

A potential alternative solution to introducing barriers etc., would be to bring back "wait for gossip to settle" at the end of node startup procedure, before it says "initialization completed".

If we want to speed things up, we could introduce a "finish early" condition to this wait. For example, we could send RPCs to all nodes to check if they see us as UP and NORMAL already, and if all of those RPCs finish, we finish the wait, even if gossip is theoretically not settled yet. In healthy network this would finish almost instantaneously.

kostja · 2024-05-15T12:22:49Z

A potential alternative solution to introducing barriers etc., would be to bring back "wait for gossip to settle" at the end of node startup procedure, before it says "initialization completed".

If we want to speed things up, we could introduce a "finish early" condition to this wait. For example, we could send RPCs to all nodes to check if they see us as UP and NORMAL already, and if all of those RPCs finish, we finish the wait, even if gossip is theoretically not settled yet. In healthy network this would finish almost instantaneously.

I am totally OK with an RPC, which is very simila to the happy path of the read barrier, I think we should avoid going back to gossip.

kbr-scylla assigned kostja and kbr-scylla May 14, 2024

kbr-scylla added the P2 High Priority label May 16, 2024

dani-tweig added the area/raft label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft topology: a node finishes bootstrapping and "completes initialization" while other nodes still didn't learn that it's normal #18678

raft topology: a node finishes bootstrapping and "completes initialization" while other nodes still didn't learn that it's normal #18678

kbr-scylla commented May 14, 2024 •

edited

tgrabiec commented May 14, 2024

tgrabiec commented May 14, 2024

kostja commented May 14, 2024

tgrabiec commented May 14, 2024

gleb-cloudius commented May 15, 2024

kbr-scylla commented May 15, 2024 •

edited

kbr-scylla commented May 15, 2024

kbr-scylla commented May 15, 2024

kostja commented May 15, 2024 •

edited

raft topology: a node finishes bootstrapping and "completes initialization" while other nodes still didn't learn that it's normal #18678

raft topology: a node finishes bootstrapping and "completes initialization" while other nodes still didn't learn that it's normal #18678

Comments

kbr-scylla commented May 14, 2024 • edited

tgrabiec commented May 14, 2024

tgrabiec commented May 14, 2024

kostja commented May 14, 2024

tgrabiec commented May 14, 2024

gleb-cloudius commented May 15, 2024

kbr-scylla commented May 15, 2024 • edited

kbr-scylla commented May 15, 2024

kbr-scylla commented May 15, 2024

kostja commented May 15, 2024 • edited

kbr-scylla commented May 14, 2024 •

edited

kbr-scylla commented May 15, 2024 •

edited

kostja commented May 15, 2024 •

edited