Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: KafkaRebalance resource reconciliation gets stuck if CC pod is restarted during rebalance #10091

Open
urbandan opened this issue May 9, 2024 · 19 comments
Assignees
Labels

Comments

@urbandan
Copy link
Contributor

urbandan commented May 9, 2024

Bug Description

Encountered a weird behaviour in 0.40. If the CC pod is restarted in the middle of a rebalance, the KafkaRebalance resource reconciliation gets stuck with the following cluster operator logs:

2024-05-09 09:27:54 INFO KafkaRebalanceAssemblyOperator:1018 - Reconciliation #184(kafkarebalance-watch) KafkaRebalance(strimzi/my-rebalance): Getting Cruise Control rebalance user task status
2024-05-09 09:27:54 ERROR ContextImpl: - Unhandled exception
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
    at jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64) ~[?:?]
    at jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70) ~[?:?]
    at jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266) ~[?:?]
    at java.util.Objects.checkIndex(Objects.java:361) ~[?:?]
    at java.util.ArrayList.get(ArrayList.java:427) ~[?:?]
    at io.vertx.core.json.JsonArray.getJsonObject(JsonArray.java:243) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.strimzi.operator.cluster.operator.resource.cruisecontrol.CruiseControlApiImpl.lambda$getUserTaskStatus$14(CruiseControlApiImpl.java:328) ~[io.strimzi.cluster-operator-0.40.0.1.0.0-b274.jar:0.40.0.1.0.0-b274]
    at io.vertx.core.impl.future.FutureImpl$1.onSuccess(FutureImpl.java:91) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.future.FutureBase.lambda$emitSuccess$0(FutureBase.java:60) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:305) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:295) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.future.FutureBase.emitSuccess(FutureBase.java:57) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.future.FutureImpl.tryComplete(FutureImpl.java:246) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.HttpEventHandler.handleEnd(HttpEventHandler.java:79) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.HttpClientResponseImpl.handleEnd(HttpClientResponseImpl.java:250) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.lambda$new$0(Http1xClientConnection.java:418) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.streams.impl.InboundBuffer.handleEvent(InboundBuffer.java:255) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.streams.impl.InboundBuffer.write(InboundBuffer.java:134) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.handleEnd(Http1xClientConnection.java:705) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:320) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.impl.ContextImpl.execute(ContextImpl.java:300) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection.handleResponseEnd(Http1xClientConnection.java:941) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection.handleHttpMessage(Http1xClientConnection.java:811) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.http.impl.Http1xClientConnection.handleMessage(Http1xClientConnection.java:775) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.net.impl.ConnectionBase.read(ConnectionBase.java:159) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.vertx.core.net.impl.VertxHandler.channelRead(VertxHandler.java:153) ~[io.vertx.vertx-core-4.5.4.jar:4.5.4]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.fireChannelRead(CombinedChannelDuplexHandler.java:436) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.CombinedChannelDuplexHandler.channelRead(CombinedChannelDuplexHandler.java:251) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) ~[io.netty.netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) ~[io.netty.netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) ~[io.netty.netty-handler-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) ~[io.netty.netty-codec-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) ~[io.netty.netty-transport-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) ~[io.netty.netty-common-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[io.netty.netty-common-4.1.107.Final.jar:4.1.107.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty.netty-common-4.1.107.Final.jar:4.1.107.Final]
    at java.lang.Thread.run(Thread.java:840) ~[?:?]

Steps to reproduce

  1. Create a kafka cluster with 1 nodepool, 3 brokers and CC
  2. Create an imbalanced state (e.g. 1 topic, 120 partitions, RF=2 and all replicas on broker 0 and 1) (Just make sure that rebalancing from this state would take at least a few minutes)
  3. Trigger a rebalance
  4. After the rebalance started, scale up the nodepool to 4
  5. Observe as the CC pod gets restarted due to the new broker in the middle of the rebalance
  6. Observe how the reconciliation of the KafkaRebalance resource gets stuck (CC does not persist tasks)

Expected behavior

No response

Strimzi version

0.40.0

Kubernetes version

v1.28.7+k3s1

Installation method

helm

Infrastructure

Bare-metal

Configuration files and logs

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: pool-a
  labels:
    strimzi.io/cluster: my-cluster
spec:
  replicas: 3
  roles:
    - broker
  storage:
    type: ephemeral
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
  annotations:
    strimzi.io/node-pools: enabled
spec:
  kafka:
    version: 3.7.0
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
    config:
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      default.replication.factor: 3
      min.insync.replicas: 2
      inter.broker.protocol.version: "3.7"
  zookeeper:
    replicas: 3
    storage:
      type: ephemeral
  entityOperator:
    topicOperator: {}
    userOperator: {}
  cruiseControl: {}
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: my-rebalance
  labels:
    strimzi.io/cluster: my-cluster
# no goals specified, using the default goals from the Cruise Control configuration
spec: {}

Additional context

No response

@scholzj
Copy link
Member

scholzj commented May 9, 2024

Thanks for raising the issue. Could you please format the YAMLs to make them readable? Thanks.

@urbandan
Copy link
Contributor Author

urbandan commented May 9, 2024

Thanks, fixed the formatting.

@ShubhamRwt
Copy link
Contributor

ShubhamRwt commented May 16, 2024

@urbandan I tried it with the latest operator(0.41.0) and was not able to get this error. For me when you increase the node pool brokers to 4 then the KafkaRebalance moves into NotReady state since the Kafka cluster is in the NotReady state as the pods are still coming which should be what we desire. Then once the Kafka cluster is up with all pods running then the KafkaRebalance will move again to ProposalReady state. I will try with 0.40.0 now

@scholzj
Copy link
Member

scholzj commented May 16, 2024

@ShubhamRwt Why would the Kafka cluster move to NotReady just because you scaled up the node pool to 4 nodes? That sounds like you had some other issue interfering with the reproducer.

@ShubhamRwt
Copy link
Contributor

ShubhamRwt commented May 16, 2024

@scholzj I meant when we scale up the nodes then pods corresponding to cruise-control and the 4th node will be coming up. It would mean that the kafka cluster is not ready yet so there we have the logic in the Rebalance operator that if the Kafka cluster is not up yet then we say Kafka cluster is not Ready.

@scholzj
Copy link
Member

scholzj commented May 16, 2024

No, the Kafka cluster should stay in Ready while scaling up.

@scholzj
Copy link
Member

scholzj commented May 16, 2024

Also, look at it differently -> what happens if you simply delete the CC pod while a rebalance is in progress?

@scholzj
Copy link
Member

scholzj commented May 16, 2024

Discussed on the community call on 16.5.2024: (Assuming this can be reproduced - see the discussion above), this should be addressed by failing the rebalance or restarting the process if possible. (Let's keep it in triage for next time to make sure it is reproducible and discuss the options)

Note: This should be already handled by the Topic Operator when changing the replication factor, where the TO detects this and automatically restarts the RF change. @fvaleri will double-check.

@fvaleri
Copy link
Contributor

fvaleri commented May 16, 2024

Yeah, we have this corner case covered in the Topic Operator.

TL/DR: Cruise Control has no memory of the task that it was working on before restart, so the operator is responsible for detecting this event and resending any ongoing task.


The operator periodically calls the user_tasks endpoint with one or more Cruise Control generated User-Task-ID, in order to check the status of pending tasks. If it gets back an empty task list, then it means that Cruise Control has restarted, and the tasks may or may not have been completed. This means that the operator must reset its internal state, switching back all ongoing resources to pending, and resend them.

Note that there is a small chance that the task could have been completed just before Cruise Control restarted, but the operator didn't had time to know that. In this case, the new task submission would be a duplicate. This is not a problem in practice, as the work has already been done, and the duplicated task would be completed quickly (no-op).

@ppatierno
Copy link
Member

Triaged on 30/5/2024: it seems to be a bug which needs to be investigated and reproduced first. @ShubhamRwt let's get in touch to understand how you tried to reproduce it with no success and what we can do.

@ShubhamRwt ShubhamRwt self-assigned this Jun 3, 2024
@ShubhamRwt
Copy link
Contributor

ShubhamRwt commented Jun 3, 2024

Hi, I was finally able to reproduce this issue with both the latest branch as well the 0.40.0, I assume this issue hapens as soon as the cruise control pod is shutdown and then when in onRebalancing we call getUserTaskStatus from inside of it. Inside the getUserTaskStatus method we have this line of code

JsonObject jsonUserTask = json.getJsonArray("userTasks").getJsonObject(0);

which tries to get the userTasks field but since the cc pod is shut down and then restarted, I assume that we are not able to find this field/or its empty. I have also thought of certain workaround to fix this

  1. If this happens we update the status of the Kafkarebalance as NotReady stating the reason that, CC pod was shutdown while rebalancing
  2. Another option could be instead of just staying on the NotReady state, we can try to make sure that once the CC pod is up again, we can generate the proposal again so that the KafkaRebalance moves to ProposalReady again.

@scholzj @ppatierno WDYT, what should be the best approach?

@ppatierno
Copy link
Member

I think approach 2 is towards the suggestion made by @fvaleri and what we have in the TO.
We have the KafkaRebalance, we have the status with the user task ID there, if on checking the status with it the result is empty, it's likely CC was restarted so we should try to re-issue the request. Only thing I am not sure if you can re-issue to run a rebalancing or you need to go through a proposal to be asked first, so actually resetting the state machine. @kyguy what's your take on this?

@ShubhamRwt
Copy link
Contributor

Hi @ppatierno, I think the best way should be to ask for a new proposal since we have now added some new broker. Regading re-issuing rebalancing, mean that we are rebalancing based on the previous proposal which was with less brokers(In case we can do it)?

@fvaleri
Copy link
Contributor

fvaleri commented Jun 5, 2024

@ShubhamRwt when scaling up, CC internal model will take some time to update with the new broker metrics, so you will likely get a transient error, but this should be already handled.

If the spec changed, I think you have to stop the current rebalance, delete and create a new one. Otherwise, you can simply refresh the existing one.

@scholzj
Copy link
Member

scholzj commented Jun 5, 2024

Hi @ppatierno, I think the best way should be to ask for a new proposal since we have now added some new broker. Regading re-issuing rebalancing, mean that we are rebalancing based on the previous proposal which was with less brokers(In case we can do it)?

I think requesting new proposal sounds reasonable. But you should consider it from a wider perspective. The scale-up described in this issue is just one scenario where it happens. But I assume the same will happen if the CC Pod is just restarted or evicted during the rebalance.

@urbandan
Copy link
Contributor Author

urbandan commented Jun 5, 2024

I think that if the original proposal was reviewed and manually triggered, an automatic new proposal + trigger is counter-intuitive. Especially if the proposal is significantly different than the previously reviewed one.
For auto-approved proposals, the automatic trigger sounds reasonable.

@scholzj
Copy link
Member

scholzj commented Jun 5, 2024

I think that if the original proposal was reviewed and manually triggered, an automatic new proposal + trigger is counter-intuitive. Especially if the proposal is significantly different than the previously reviewed one. For auto-approved proposals, the automatic trigger sounds reasonable.

Yeah, I definitely agree with this. If there is a new proposal, it would need a new approval (unless the auto-approval is enabled).

@ppatierno
Copy link
Member

+1 for a new proposal but agree with Jakub. Scaling up is just a use case, the fix should apply to other situation like (CC restarting).

@kyguy
Copy link
Member

kyguy commented Jun 5, 2024

The scale-up described in this issue is just one scenario where it happens. But I assume the same will happen if the CC Pod is just restarted or evicted during the rebalance.

Until CC can handle dynamic configuration updates without restarting or we can devise a simple method of determining the cause of a CC restart (whether it be due to a scaling operation or CC crash/eviction) it may make sense to to just recover from all CC restarts in the same way (assuming the performance costs are reasonable). From what is discussed above, we appear to have two options to recover from a CC restart:

(A) Request a new proposal

PROS: 
     - Based on latest cluster model, including all brokers
CONS:
     - Slower than reusing existing proposal of `KafkaRebalance` resource

(B) Resend existing proposal of KafkaRebalance resource

PROS: 
     - Faster than requesting a new proposal
CONS:
     - Existing proposal may be stale due to the reason or duration of the CC restart

Given Cruise Control's window samples persist in Kafka, the delay of proposal generation shouldn't be long. Another thing worth thinking about is if CC did support dynamic configuration updates without restart. Would we stop an ongoing rebalance because a new broker was added? What about removed? Should scaling operations be blocked when there is an ongoing rebalance?

I think that if the original proposal was reviewed and manually triggered, an automatic new proposal + trigger is counter-intuitive. Especially if the proposal is significantly different than the previously reviewed one.
For auto-approved proposals, the automatic trigger sounds reasonable.

If there is a new proposal, it would need a new approval (unless the auto-approval is enabled).

+1 sounds reasonable to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants