Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raft topology: a node finishes bootstrapping and "completes initialization" while other nodes still didn't learn that it's normal #18678

Open
kbr-scylla opened this issue May 14, 2024 · 9 comments
Assignees
Labels
area/raft P2 High Priority

Comments

@kbr-scylla
Copy link
Contributor

kbr-scylla commented May 14, 2024

I actually don't know if this is a bug and if it should be considered an issue. But we can use this issue to discuss and potentially decide that it is fixed.

A dtest failed in debug mode: update_cluster_layout_tests.py::TestUpdateClusterLayout::test_simple_add_two_nodes_in_parallel[case_1]
attaching logs:
dtest-gw2.log
node1.log
node2.log
node3.log

In short, the node bootstrapped 3 nodes. After confirming that they all saw each other as alive and all printed "initialization completed" it connects to node2 (exclusive CQL connection) and tries to do a query with CL=THREE.

The test failed because the query returned unavailable exception:

E cassandra.Unavailable: Error from server: code=1000 [Unavailable exception] message="Cannot achieve consistency level for cl THREE. Requires 3, alive 2" info={'consistency': 'THREE', 'required_replicas': 3, 'alive_replicas': 2}

First of all, if I understand correctly how this exception is constructed, the printed numbers are actually wrong if there is a bootstrapping node and we hit a key that is replicated to a pending replica. If there is a bootstrapping node, then in the case of cl=THREE it should say "Requires 4" because we increase the CL by 1.

I didn't confirm 100% that my explanation below is the cause of this failure, but I can see that my explanation is something that could happen, and it is a plausible explanation for this failure, and I have no other hypotheses -- so that's probably it.

The explanation is that even though node 3 printed "initialization completed", node 2 still thinks it is bootstrapping, so token_metadata contains tokens of node 3 as pending. This can happen because we don't do a global read barrier between leaving write_both_read_new and the bootstrapping node printing "initialization completed". So those last commands that update topology state might not have yet reached node 2.

Therefore, at the moment of the test trying to do a CL=THREE query through node 2, the node thinks that there is a pending replica, so it requires 4 alive nodes for that query.

Also note that recently we removed "wait for gossip to settle" before printing "initialization completed" and opening CQL port (65cfb9b) -- this would actually prevent this failure from happening. (And from what I see, this test only started failing recently, so that could be the cause.)

The question is:

  • should this be considered a bug, which we could fix by e.g. adding a new transition state at the end of bootstrap procedure, which only does a global read barrier? If I remember correctly, @tgrabiec proposed a state like that (use_new) when writing down spec for consistent topology changes
  • or should we ignore it and just adjust the test to wait... but there's the question, what should the test wait for exactly? And note that this is a problem not only for tests -- it is a problem for any kind of automation that sets up clusters. On the other hand, I'm not sure if anybody uses CLs like TWO or THREE, most should use QUORUM or ALL. It could potentially happen with CL=ONE, RF=1, for example, if I understand correctly, and that's more realistic. So we could expect issue reports from Scylla Operator team and others if we don't fix it.

@kostja @gleb-cloudius @tgrabiec -- opinions?

@tgrabiec
Copy link
Contributor

The explanation is that even though node 3 printed "initialization completed", node 2 still thinks it is bootstrapping, so token_metadata contains tokens of node 3 as pending. This can happen because we don't do a global read barrier between leaving write_both_read_new and the bootstrapping node printing "initialization completed". So those last commands that update topology state might not have yet reached node 2.

Therefore, at the moment of the test trying to do a CL=THREE query through node 2, the node thinks that there is a pending replica, so it requires 4 alive nodes for that query. (CL=ALL would work.)

If node 3 is pending according to erm, why would coordinator require 4 nodes? In that topology version there are 2 normal replicas + 1 pending.

@tgrabiec
Copy link
Contributor

Ok I see, it's because CL=3 is supposed to fail during bootstrap of the third node, unlike RF=3 which works fine.

In test, you could wait on node2 for topology to quiesce. I added an API for that in 190bdc3 (not merged yet).

@kostja
Copy link
Contributor

kostja commented May 14, 2024

I believe this should be fixed. I stated this on earlier occasions, ScyllaDB should actively cooperate with the clients during bootstrap on its bootstrap status, and it should only begin serving queries after all the reasonable efforts were made to get up to date with the cluster and make the cluster aware of this node.

Indeed it's impossible to guarantee in all cases that a starting node can achieve a desired state, but the goal is to avoid errors like this one in a healthy network with healthy nodes.

Issuing an extra read barrier therefore seems to be a reasonable effort - it should not be mandatory for becoming live, perhaps should time-out with a warning in the log after 5 seconds.

@tgrabiec
Copy link
Contributor

@kostja Sounds reasonable to me. Maybe except 5 sec timeout, which may be not enough in our CI environment. It could be a regular barrier, we do several of them already so one more won't make us more vulnerable to unavailability.

@gleb-cloudius
Copy link
Contributor

Therefore, at the moment of the test trying to do a CL=THREE query through node 2, the node thinks that there is a pending replica, so it requires 4 alive nodes for that query. (CL=ALL would work.)

Why would CL=ALL work?

@kbr-scylla
Copy link
Contributor Author

kbr-scylla commented May 15, 2024

Why would CL=ALL work?

Sorry, I think I made a mistake. With RF=3 + CL=ALL, we would calculate that 3 natural replicas are needed, (the calculation doesn't take into account how many token owners there are in the ring, it looks at keyspace configuration instead), therefore during bootstrap when hitting a pending range, we'd require 3 natural + 1 pending, 4 in total, so CL=ALL should also fail.

Will edit the post and remove that part (edit: done)


It could be a regular barrier, we do several of them already so one more won't make us more vulnerable to unavailability.

It could work like that:

  • introduce a new transition_state after write_both_read_new, e.g., use_new,
  • in use_new tokens of booting node are already normal
  • the only thing we do in this state is a global barrier, then finish the saga (i.e., remove transition_state, and mark topology request as completed)
  • therefore, the bootstrapping node which is waiting for request to complete, would only print initialization completed after all nodes learned that its tokens are normal.

Or do you have something else (maybe simpler) in mind @tgrabiec?

@kbr-scylla
Copy link
Contributor Author

I added an API for that in 190bdc3 (not merged yet).

Looks suspiciously similar to "wait for gossip to settle"... which we disabled...

@kbr-scylla
Copy link
Contributor Author

A potential alternative solution to introducing barriers etc., would be to bring back "wait for gossip to settle" at the end of node startup procedure, before it says "initialization completed".

If we want to speed things up, we could introduce a "finish early" condition to this wait. For example, we could send RPCs to all nodes to check if they see us as UP and NORMAL already, and if all of those RPCs finish, we finish the wait, even if gossip is theoretically not settled yet. In healthy network this would finish almost instantaneously.

@kostja
Copy link
Contributor

kostja commented May 15, 2024

A potential alternative solution to introducing barriers etc., would be to bring back "wait for gossip to settle" at the end of node startup procedure, before it says "initialization completed".

If we want to speed things up, we could introduce a "finish early" condition to this wait. For example, we could send RPCs to all nodes to check if they see us as UP and NORMAL already, and if all of those RPCs finish, we finish the wait, even if gossip is theoretically not settled yet. In healthy network this would finish almost instantaneously.

I am totally OK with an RPC, which is very simila to the happy path of the read barrier, I think we should avoid going back to gossip.

@kbr-scylla kbr-scylla added the P2 High Priority label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/raft P2 High Priority
Projects
None yet
Development

No branches or pull requests

5 participants