Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: KafkaRebalance not respecting replicationThrottle #10122

Closed
AlbertoEAF opened this issue May 16, 2024 · 7 comments
Closed

[Bug]: KafkaRebalance not respecting replicationThrottle #10122

AlbertoEAF opened this issue May 16, 2024 · 7 comments
Labels

Comments

@AlbertoEAF
Copy link

Bug Description

Hello,

I was triggering a KafkaRebalance in the mode add-brokers after adding one broker to have a cluster with 6 brokers.

I set a really low threshold of replicationThrottle=10 (which is apparently in bytes/s according to the source code), as reported by the description of the KafkaRebalance with k9s:
image

however, I am observing writes to the new broker at a rate of ~140MiB/s:
image

Steps to reproduce

This is the KafkaRebalance CRD I wrote and approved at first:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaRebalance
metadata:
  name: rebalance-proposal
  labels:
    strimzi.io/cluster: kafka-cluster
spec:
  # We expect to have to transfer ~800GB. At a max rate of 10MB/s (the cluster is currently operating at ~20-30MB/s), this amounts to ~22h:
  replicationThrottle: 10 # max bytes/s. By default there is no limit.  
  mode: add-brokers
  brokers: [5]  # Add 6th broker (index 5) to the cluster.

I first started with replicationThrottle=10000 (supposedly ~10MB/s), but no throttling happened, so I tried 10 and the same write throughput to the new broker of 140MiB/s happened -- that's when I understood the parameter wasn't taking effect at all.

Expected behavior

The replicationThrottle parameter should be respected, meaning my new broker would write at most <replicationThrottle value> [bytes/s].

Strimzi version

0.40.0

Kubernetes version

kuberenetes v1.26.15-eks-adc7111

Installation method

unknown but I can ask if needed

Infrastructure

Amazon EKS

Configuration files and logs

No response

Additional context

No response

@scholzj
Copy link
Member

scholzj commented May 16, 2024

Is it a duplicate of #9972?

@AlbertoEAF
Copy link
Author

AlbertoEAF commented May 16, 2024

Not sure,

But I was just now reading the docs on the KafkaTopic and indeed the default values (which I'm using) for the topics are to have no throttled replicas in both:

Does this mean I was simply misusing the system and I should set both these parameters in all my topics to "*" to benefit from this replicationThrottle?

Or it should be working already and this is not the solution?

@scholzj
Copy link
Member

scholzj commented May 16, 2024

I'm not sure what exactly you mean with setting them to "*". But yes, configuring the throtling in the KafkaTopic resource might help.

@AlbertoEAF
Copy link
Author

AlbertoEAF commented May 16, 2024

I'm not sure what exactly you mean with setting them to "*". But yes, configuring the throtling in the KafkaTopic resource might help.

In the docs they state this:

follower.replication.throttled.replicas
A list of replicas for which log replication should be throttled on the follower side. The list should describe a set of replicas in the form [PartitionId]:[BrokerId],[PartitionId]:[BrokerId]:… or alternatively the wildcard ‘*’ can be used to throttle all replicas for this topic.

Shouldn't I simply set "*" on those 2 parameters on all my topics to fix this?

I'll try it in our staging environment and report back, but if you have other ideas feel free to make suggestions!

@scholzj
Copy link
Member

scholzj commented May 16, 2024

Ahh, ok ... I'm afraid I do not know how it works in detail, sorry.

@fvaleri
Copy link
Contributor

fvaleri commented May 17, 2024

@AlbertoEAF these throttling configs are updated dynamically by Cruise Control while the rebalance is running. The issue is caused by the Topic Operator which reverts them because they are not part of the topic spec. If the Topic Operator is not running, then it should work fine (not ideal). See #9972 for more details and a possible workaround.

Cruise Control uses single partitions when setting these configs and updates them as needed according to the current traffic (workload model). Having a static configuration with * could make the rebalance less efficient, but it may help with reducing the impact on clients. Not sure how this would work in practice. Let us know your findings.

We are already working on a proper solution where the Topic Operator will automatically ignore both leader.replication.throttled.replicas and follower.replication.throttled.replicas when Cruise Control integration is enabled.

@ppatierno
Copy link
Member

Triaged on 30/5/2024: @AlbertoEAF we are going to close this one as duplicated of #9972 and @fvaleri is going to ping the community user interested to fix it. If not he will take care of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants