[Bug]: KafkaRebalance resource reconciliation gets stuck if CC pod is restarted during rebalance #10091

urbandan · 2024-05-09T09:45:29Z

Bug Description

Encountered a weird behaviour in 0.40. If the CC pod is restarted in the middle of a rebalance, the KafkaRebalance resource reconciliation gets stuck with the following cluster operator logs:

2024-05-09 09:27:54 INFO KafkaRebalanceAssemblyOperator:1018 - Reconciliation #184(kafkarebalance-watch) KafkaRebalance(strimzi/my-rebalance): Getting Cruise Control rebalance user task status
2024-05-09 09:27:54 ERROR ContextImpl: - Unhandled exception
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
    at jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64) ~[?:?]
    at jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70) ~[?:?]
    at jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266) ~[?:?]
    at java.util.Objects.checkIndex(Objects.java:361) ~[?:?]
    at java.util.ArrayList.get(ArrayList.java:427) ~[?:?]
    at io.vertx.core.json.JsonArray.getJsonObject(JsonArray.java:243) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.strimzi.operator.cluster.operator.resource.cruisecontrol.CruiseControlApiImpl.lambda$getUserTaskStatus$14(CruiseControlApiImpl.java:328) ~[io.strimzi.cluster-operator-0.40.0.1.0.0-b274.jar:0.40.0.1.0.0-b274]
    at io.vertx.core.impl.future.FutureImpl$1.onSuccess(FutureImpl.java:91) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:60) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:305) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:295) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:57) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:246) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.HttpEventHandler.handleEnd(HttpEventHandler.java:79) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.HttpClientResponseImpl.handleEnd(HttpClientResponseImpl.java:250) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.lambda$new$0(Http1xClientConnection.java:418) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.streams.impl.InboundBuffer.handleEvent(InboundBuffer.java:255) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.streams.impl.InboundBuffer.write(InboundBuffer.java:134) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.handleEnd(Http1xClientConnection.java:705) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:320) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:300) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection.handleResponseEnd(Http1xClientConnection.java:941) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection.handleHttpMessage(Http1xClientConnection.java:811) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection.handleMessage(Http1xClientConnection.java:775) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.net.impl.ConnectionBase.read(ConnectionBase.java:159) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.net.impl.VertxHandler.channelRead(VertxHandler.java:153) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) ~[io.netty.netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) ~[io.netty.netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) ~[io.netty.netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.107.Final.jar:4.1.107.Final]
    at java.lang.Thread.run(Thread.java:840) ~[?:?]

Steps to reproduce

Create a kafka cluster with 1 nodepool, 3 brokers and CC
Create an imbalanced state (e.g. 1 topic, 120 partitions, RF=2 and all replicas on broker 0 and 1) (Just make sure that rebalancing from this state would take at least a few minutes)
Trigger a rebalance
After the rebalance started, scale up the nodepool to 4
Observe as the CC pod gets restarted due to the new broker in the middle of the rebalance
Observe how the reconciliation of the KafkaRebalance resource gets stuck (CC does not persist tasks)

Expected behavior

No response

Strimzi version

0.40.0

Kubernetes version

v1.28.7+k3s1

Installation method

helm

Infrastructure

Bare-metal

Configuration files and logs

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: pool-a
  labels:
    strimzi.io/cluster: my-cluster
spec:
  replicas: 3
  roles:
    - broker
  storage:
    type: ephemeral
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
  annotations:
    strimzi.io/node-pools: enabled
spec:
  kafka:
    version: 3.7.0
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.7"
  zookeeper:
    replicas: 3
    storage:
      type: ephemeral
  entityOperator:
    topicOperator: {}
    userOperator: {}
  cruiseControl: {}
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: my-rebalance
  labels:
    strimzi.io/cluster: my-cluster
# no goals specified, using the default goals from the Cruise Control configuration
spec: {}

Additional context

No response

The text was updated successfully, but these errors were encountered:

scholzj · 2024-05-09T12:53:55Z

Thanks for raising the issue. Could you please format the YAMLs to make them readable? Thanks.

urbandan · 2024-05-09T12:57:56Z

Thanks, fixed the formatting.

ShubhamRwt · 2024-05-16T12:14:23Z

@urbandan I tried it with the latest operator(0.41.0) and was not able to get this error. For me when you increase the node pool brokers to 4 then the KafkaRebalance moves into NotReady state since the Kafka cluster is in the NotReady state as the pods are still coming which should be what we desire. Then once the Kafka cluster is up with all pods running then the KafkaRebalance will move again to ProposalReady state. I will try with 0.40.0 now

scholzj · 2024-05-16T12:19:50Z

@ShubhamRwt Why would the Kafka cluster move to NotReady just because you scaled up the node pool to 4 nodes? That sounds like you had some other issue interfering with the reproducer.

ShubhamRwt · 2024-05-16T12:31:42Z

@scholzj I meant when we scale up the nodes then pods corresponding to cruise-control and the 4th node will be coming up. It would mean that the kafka cluster is not ready yet so there we have the logic in the Rebalance operator that if the Kafka cluster is not up yet then we say Kafka cluster is not Ready.

scholzj · 2024-05-16T12:42:25Z

No, the Kafka cluster should stay in Ready while scaling up.

scholzj · 2024-05-16T12:50:23Z

Also, look at it differently -> what happens if you simply delete the CC pod while a rebalance is in progress?

scholzj · 2024-05-16T16:26:35Z

Discussed on the community call on 16.5.2024: (Assuming this can be reproduced - see the discussion above), this should be addressed by failing the rebalance or restarting the process if possible. (Let's keep it in triage for next time to make sure it is reproducible and discuss the options)

Note: This should be already handled by the Topic Operator when changing the replication factor, where the TO detects this and automatically restarts the RF change. @fvaleri will double-check.

fvaleri · 2024-05-16T17:23:00Z

Yeah, we have this corner case covered in the Topic Operator.

TL/DR: Cruise Control has no memory of the task that it was working on before restart, so the operator is responsible for detecting this event and resending any ongoing task.

The operator periodically calls the user_tasks endpoint with one or more Cruise Control generated User-Task-ID, in order to check the status of pending tasks. If it gets back an empty task list, then it means that Cruise Control has restarted, and the tasks may or may not have been completed. This means that the operator must reset its internal state, switching back all ongoing resources to pending, and resend them.

Note that there is a small chance that the task could have been completed just before Cruise Control restarted, but the operator didn't had time to know that. In this case, the new task submission would be a duplicate. This is not a problem in practice, as the work has already been done, and the duplicated task would be completed quickly (no-op).

ppatierno · 2024-05-30T08:21:09Z

Triaged on 30/5/2024: it seems to be a bug which needs to be investigated and reproduced first. @ShubhamRwt let's get in touch to understand how you tried to reproduce it with no success and what we can do.

ShubhamRwt · 2024-06-03T09:44:53Z

Hi, I was finally able to reproduce this issue with both the latest branch as well the 0.40.0, I assume this issue hapens as soon as the cruise control pod is shutdown and then when in onRebalancing we call getUserTaskStatus from inside of it. Inside the getUserTaskStatus method we have this line of code

JsonObject jsonUserTask = json.getJsonArray("userTasks").getJsonObject(0);

which tries to get the userTasks field but since the cc pod is shut down and then restarted, I assume that we are not able to find this field/or its empty. I have also thought of certain workaround to fix this

If this happens we update the status of the Kafkarebalance as NotReady stating the reason that, CC pod was shutdown while rebalancing
Another option could be instead of just staying on the NotReady state, we can try to make sure that once the CC pod is up again, we can generate the proposal again so that the KafkaRebalance moves to ProposalReady again.

@scholzj @ppatierno WDYT, what should be the best approach?

ppatierno · 2024-06-04T16:08:45Z

I think approach 2 is towards the suggestion made by @fvaleri and what we have in the TO.
We have the KafkaRebalance, we have the status with the user task ID there, if on checking the status with it the result is empty, it's likely CC was restarted so we should try to re-issue the request. Only thing I am not sure if you can re-issue to run a rebalancing or you need to go through a proposal to be asked first, so actually resetting the state machine. @kyguy what's your take on this?

ShubhamRwt · 2024-06-05T12:43:57Z

Hi @ppatierno, I think the best way should be to ask for a new proposal since we have now added some new broker. Regading re-issuing rebalancing, mean that we are rebalancing based on the previous proposal which was with less brokers(In case we can do it)?

fvaleri · 2024-06-05T13:09:29Z

@ShubhamRwt when scaling up, CC internal model will take some time to update with the new broker metrics, so you will likely get a transient error, but this should be already handled.

If the spec changed, I think you have to stop the current rebalance, delete and create a new one. Otherwise, you can simply refresh the existing one.

scholzj · 2024-06-05T13:27:21Z

Hi @ppatierno, I think the best way should be to ask for a new proposal since we have now added some new broker. Regading re-issuing rebalancing, mean that we are rebalancing based on the previous proposal which was with less brokers(In case we can do it)?

I think requesting new proposal sounds reasonable. But you should consider it from a wider perspective. The scale-up described in this issue is just one scenario where it happens. But I assume the same will happen if the CC Pod is just restarted or evicted during the rebalance.

urbandan · 2024-06-05T13:34:29Z

I think that if the original proposal was reviewed and manually triggered, an automatic new proposal + trigger is counter-intuitive. Especially if the proposal is significantly different than the previously reviewed one.
For auto-approved proposals, the automatic trigger sounds reasonable.

scholzj · 2024-06-05T14:27:30Z

I think that if the original proposal was reviewed and manually triggered, an automatic new proposal + trigger is counter-intuitive. Especially if the proposal is significantly different than the previously reviewed one. For auto-approved proposals, the automatic trigger sounds reasonable.

Yeah, I definitely agree with this. If there is a new proposal, it would need a new approval (unless the auto-approval is enabled).

ppatierno · 2024-06-05T16:19:25Z

+1 for a new proposal but agree with Jakub. Scaling up is just a use case, the fix should apply to other situation like (CC restarting).

kyguy · 2024-06-05T20:16:24Z

The scale-up described in this issue is just one scenario where it happens. But I assume the same will happen if the CC Pod is just restarted or evicted during the rebalance.

Until CC can handle dynamic configuration updates without restarting or we can devise a simple method of determining the cause of a CC restart (whether it be due to a scaling operation or CC crash/eviction) it may make sense to to just recover from all CC restarts in the same way (assuming the performance costs are reasonable). From what is discussed above, we appear to have two options to recover from a CC restart:

(A) Request a new proposal

PROS: 
     - Based on latest cluster model, including all brokers
CONS:
     - Slower than reusing existing proposal of `KafkaRebalance` resource

(B) Resend existing proposal of KafkaRebalance resource

PROS: 
     - Faster than requesting a new proposal
CONS:
     - Existing proposal may be stale due to the reason or duration of the CC restart

Given Cruise Control's window samples persist in Kafka, the delay of proposal generation shouldn't be long. Another thing worth thinking about is if CC did support dynamic configuration updates without restart. Would we stop an ongoing rebalance because a new broker was added? What about removed? Should scaling operations be blocked when there is an ongoing rebalance?

I think that if the original proposal was reviewed and manually triggered, an automatic new proposal + trigger is counter-intuitive. Especially if the proposal is significantly different than the previously reviewed one.
For auto-approved proposals, the automatic trigger sounds reasonable.

If there is a new proposal, it would need a new approval (unless the auto-approval is enabled).

+1 sounds reasonable to me

urbandan added bug needs-triage labels May 9, 2024

ppatierno removed the needs-triage label May 30, 2024

ShubhamRwt self-assigned this Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: KafkaRebalance resource reconciliation gets stuck if CC pod is restarted during rebalance #10091

[Bug]: KafkaRebalance resource reconciliation gets stuck if CC pod is restarted during rebalance #10091

urbandan commented May 9, 2024 •

edited

scholzj commented May 9, 2024

urbandan commented May 9, 2024

ShubhamRwt commented May 16, 2024 •

edited

scholzj commented May 16, 2024

ShubhamRwt commented May 16, 2024 •

edited

scholzj commented May 16, 2024

scholzj commented May 16, 2024

scholzj commented May 16, 2024 •

edited

fvaleri commented May 16, 2024 •

edited

ppatierno commented May 30, 2024

ShubhamRwt commented Jun 3, 2024 •

edited

ppatierno commented Jun 4, 2024

ShubhamRwt commented Jun 5, 2024

fvaleri commented Jun 5, 2024

scholzj commented Jun 5, 2024

urbandan commented Jun 5, 2024

scholzj commented Jun 5, 2024

ppatierno commented Jun 5, 2024

kyguy commented Jun 5, 2024

[Bug]: KafkaRebalance resource reconciliation gets stuck if CC pod is restarted during rebalance #10091

[Bug]: KafkaRebalance resource reconciliation gets stuck if CC pod is restarted during rebalance #10091

Comments

urbandan commented May 9, 2024 • edited

Bug Description

Steps to reproduce

Expected behavior

Strimzi version

Kubernetes version

Installation method

Infrastructure

Configuration files and logs

Additional context

scholzj commented May 9, 2024

urbandan commented May 9, 2024

ShubhamRwt commented May 16, 2024 • edited

scholzj commented May 16, 2024

ShubhamRwt commented May 16, 2024 • edited

scholzj commented May 16, 2024

scholzj commented May 16, 2024

scholzj commented May 16, 2024 • edited

fvaleri commented May 16, 2024 • edited

ppatierno commented May 30, 2024

ShubhamRwt commented Jun 3, 2024 • edited

ppatierno commented Jun 4, 2024

ShubhamRwt commented Jun 5, 2024

fvaleri commented Jun 5, 2024

scholzj commented Jun 5, 2024

urbandan commented Jun 5, 2024

scholzj commented Jun 5, 2024

ppatierno commented Jun 5, 2024

kyguy commented Jun 5, 2024

urbandan commented May 9, 2024 •

edited

ShubhamRwt commented May 16, 2024 •

edited

ShubhamRwt commented May 16, 2024 •

edited

scholzj commented May 16, 2024 •

edited

fvaleri commented May 16, 2024 •

edited

ShubhamRwt commented Jun 3, 2024 •

edited