[KRaft] Scaling of controller nodes #9429

scholzj · 2023-12-04T15:45:07Z

Scaling of KRaft controller-only nodes currently doesn't seem to work. Below are described the situations and issues:

Scale-up

The scale-up seems to currently work like this:

The controller node pool is scaled up
First, the new controller nodes are started with the new voter configuration
Next, the old controller nodes are rolled with the new voter configuration

As the old controller nodes are rolled, the leader-controller is shifting and it seems it typically ends up on one of the new nodes.
At that point, all broker nodes seem to fail with the following error as they are not yet rolled and do not know the new leader controller.

 2023-12-04 14:16:50,669 ERROR Encountered fatal fault: Unexpected error in raft IO thread (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler) [kafka-2001-raft-io-thread]
 java.lang.IllegalStateException: Cannot transition to Follower with leaderId=4 and epoch=48 since it is not one of the voters [0, 1, 2]
     at org.apache.kafka.raft.QuorumState.transitionToFollower(QuorumState.java:382)
     at org.apache.kafka.raft.KafkaRaftClient.transitionToFollower(KafkaRaftClient.java:522)
     at org.apache.kafka.raft.KafkaRaftClient.maybeTransition(KafkaRaftClient.java:1575)
     at org.apache.kafka.raft.KafkaRaftClient.maybeHandleCommonResponse(KafkaRaftClient.java:1532)
     at org.apache.kafka.raft.KafkaRaftClient.handleFetchResponse(KafkaRaftClient.java:1113)
     at org.apache.kafka.raft.KafkaRaftClient.handleResponse(KafkaRaftClient.java:1609)
     at org.apache.kafka.raft.KafkaRaftClient.handleInboundMessage(KafkaRaftClient.java:1735)
     at org.apache.kafka.raft.KafkaRaftClient.poll(KafkaRaftClient.java:2310)
     at kafka.raft.KafkaRaftManager$RaftIoThread.doWork(RaftManager.scala:64)
     at org.apache.kafka.server.util.ShutdownableThread.run(ShutdownableThread.java:130)

After rolling all the controllers, the operator rolls the brokers to introduce the new voter configuration and gets them working properly again

This seemed to work for me this was several times for scaling from 3->5 and 3->4->5.

I guess we could improve this by adding the new nodes after rolling the brokers? E.g.:

Do the rolling => roll the controllers to expect more voters and roll the brokers to expect more voters
Add the new nodes

If that is done in steps without breaking the quorum, it will likely work without causing the error in the broker nodes?

The scaling up of mixed nodes seems similar. But because the brokers are controllers as well, they do not error as with dedicated controller nodes.

Scale-down

Scale-down currently works like this (in theory only):

Remove the controllers to be scaled down
Roll the remaining controllers to reconfigure the voters
Roll the brokers to reconfigure the voters

Scale-down from 5->4 nodes seems to work fine by following the steps above as it does not break the quorum. However, scale-down from 4->3 nodes will break the quorum:

You remove the 4th controllers
You need to roll the remaining 3 controllers
But they are configured with 4 voters so quorum needs 3 nodes online and as we have only 3 nodes left, we cannot shutdown any of them to roll it / update the voters to 3.

As a result, it gets stuck with the following error:

2023-12-04 17:13:06 WARN  KafkaQuorumCheck:98 - Reconciliation #1609(timer) Kafka(myproject/my-cluster): No valid lastCaughtUpTimestamp is found for controller 3

Scale-down of mixed nodes seems to go similarly. It looks like there is a race condition and sometimes the deleted node is still seen as in-sync and the first node is rolled. But it gets stuck on the next in that case there the controller 3 has again no valid timestamp.

Would removing the controllers only at the end help? Can the KafkaRoller force this despite breaking the quorum when there is a scale-down?

Next steps

Implementing these things could be quite complicated (even assuming the delayed creation / removal of the scaled nodes really works as I did not test it). Is it required for the GA of KRaft in Strimzi? Should we wait for KIP-853 to be implemented that might allow us to change the controller quorums dynamically?

The text was updated successfully, but these errors were encountered:

showuon · 2023-12-05T03:30:45Z

I agree we don't need to workaround it for now because KIP-853 is still under discussion in Kafka community.

ppatierno · 2023-12-05T12:22:00Z

I agree, we should wait for a proper fix in Kafka upstream. Making it working with workaround and/or manual intervention is more than an hack, losing all the automation that the operator should provide in a such operation like scaling up/down.

Is it required for the GA of KRaft in Strimzi?

Controllers are replacing ZooKeeper nodes. I was wondering how many users need to scale up/down a ZooKeeper ensemble today. I don't think that many. Of course, right now you can scale ZooKeeper if needed. Without KIP-853 you could not scale controllers properly, so it would be a not parity on features. If by GA we want to provide features parity we should wait otherwise I am fine to go GA without it.

scholzj · 2023-12-14T08:42:27Z

Discussed in the community call on 14.12.: Should be fixed, but it seems like it would make sense to wait for the KIP-853.

scholzj added the needs-triage label Dec 4, 2023

scholzj changed the title ~~[KRaft] Scaling of controller-only nodes~~ [KRaft] Scaling of controller nodes Dec 4, 2023

scholzj added the KRaft label Dec 5, 2023

scholzj removed the needs-triage label Dec 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KRaft] Scaling of controller nodes #9429

[KRaft] Scaling of controller nodes #9429

scholzj commented Dec 4, 2023 •

edited

showuon commented Dec 5, 2023

ppatierno commented Dec 5, 2023

scholzj commented Dec 14, 2023

[KRaft] Scaling of controller nodes #9429

[KRaft] Scaling of controller nodes #9429

Comments

scholzj commented Dec 4, 2023 • edited

Scale-up

Scale-down

Next steps

showuon commented Dec 5, 2023

ppatierno commented Dec 5, 2023

scholzj commented Dec 14, 2023

scholzj commented Dec 4, 2023 •

edited