Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KRaft] Scaling of controller nodes #9429

Open
scholzj opened this issue Dec 4, 2023 · 3 comments
Open

[KRaft] Scaling of controller nodes #9429

scholzj opened this issue Dec 4, 2023 · 3 comments
Labels

Comments

@scholzj
Copy link
Member

scholzj commented Dec 4, 2023

Scaling of KRaft controller-only nodes currently doesn't seem to work. Below are described the situations and issues:

Scale-up

The scale-up seems to currently work like this:

  1. The controller node pool is scaled up
  2. First, the new controller nodes are started with the new voter configuration
  3. Next, the old controller nodes are rolled with the new voter configuration
  4. As the old controller nodes are rolled, the leader-controller is shifting and it seems it typically ends up on one of the new nodes.
    At that point, all broker nodes seem to fail with the following error as they are not yet rolled and do not know the new leader controller.
     2023-12-04 14:16:50,669 ERROR Encountered fatal fault: Unexpected error in raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler) [kafka-2001-raft-io-thread]
     java.lang.IllegalStateException: Cannot transition to Follower with leaderId=4 and epoch=48 since it is not one of the voters [0, 1, 2]
         at org.apache.kafka.raft.QuorumState.transitionToFollower(QuorumState.java:382)
         at org.apache.kafka.raft.KafkaRaftClient.transitionToFollower(KafkaRaftClient.java:522)
         at org.apache.kafka.raft.KafkaRaftClient.maybeTransition(KafkaRaftClient.java:1575)
         at org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:1532)
         at org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1113)
         at org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:1609)
         at org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:1735)
         at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:2310)
         at kafka.raft.KafkaRaftManager$RaftIoThread.doWork(RaftManager.scala:64)
         at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)
    
  5. After rolling all the controllers, the operator rolls the brokers to introduce the new voter configuration and gets them working properly again

This seemed to work for me this was several times for scaling from 3->5 and 3->4->5.

I guess we could improve this by adding the new nodes after rolling the brokers? E.g.:

  1. Do the rolling => roll the controllers to expect more voters and roll the brokers to expect more voters
  2. Add the new nodes

If that is done in steps without breaking the quorum, it will likely work without causing the error in the broker nodes?

The scaling up of mixed nodes seems similar. But because the brokers are controllers as well, they do not error as with dedicated controller nodes.

Scale-down

Scale-down currently works like this (in theory only):

  1. Remove the controllers to be scaled down
  2. Roll the remaining controllers to reconfigure the voters
  3. Roll the brokers to reconfigure the voters

Scale-down from 5->4 nodes seems to work fine by following the steps above as it does not break the quorum. However, scale-down from 4->3 nodes will break the quorum:

  1. You remove the 4th controllers
  2. You need to roll the remaining 3 controllers
  3. But they are configured with 4 voters so quorum needs 3 nodes online and as we have only 3 nodes left, we cannot shutdown any of them to roll it / update the voters to 3.

As a result, it gets stuck with the following error:

2023-12-04 17:13:06 WARN  KafkaQuorumCheck:98 - Reconciliation #1609(timer) Kafka(myproject/my-cluster): No valid lastCaughtUpTimestamp is found for controller 3

Scale-down of mixed nodes seems to go similarly. It looks like there is a race condition and sometimes the deleted node is still seen as in-sync and the first node is rolled. But it gets stuck on the next in that case there the controller 3 has again no valid timestamp.

Would removing the controllers only at the end help? Can the KafkaRoller force this despite breaking the quorum when there is a scale-down?

Next steps

Implementing these things could be quite complicated (even assuming the delayed creation / removal of the scaled nodes really works as I did not test it). Is it required for the GA of KRaft in Strimzi? Should we wait for KIP-853 to be implemented that might allow us to change the controller quorums dynamically?

@scholzj scholzj changed the title [KRaft] Scaling of controller-only nodes [KRaft] Scaling of controller nodes Dec 4, 2023
@showuon
Copy link
Contributor

showuon commented Dec 5, 2023

I agree we don't need to workaround it for now because KIP-853 is still under discussion in Kafka community.

@ppatierno
Copy link
Member

I agree, we should wait for a proper fix in Kafka upstream. Making it working with workaround and/or manual intervention is more than an hack, losing all the automation that the operator should provide in a such operation like scaling up/down.

Is it required for the GA of KRaft in Strimzi?

Controllers are replacing ZooKeeper nodes. I was wondering how many users need to scale up/down a ZooKeeper ensemble today. I don't think that many. Of course, right now you can scale ZooKeeper if needed. Without KIP-853 you could not scale controllers properly, so it would be a not parity on features. If by GA we want to provide features parity we should wait otherwise I am fine to go GA without it.

@scholzj scholzj added the KRaft label Dec 5, 2023
@scholzj
Copy link
Member Author

scholzj commented Dec 14, 2023

Discussed in the community call on 14.12.: Should be fixed, but it seems like it would make sense to wait for the KIP-853.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants