-
-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumer group rebalancing bug when switching from eager to cooperative consumers #686
Comments
Got the same issue on 1.14.0, 1.15.0
It becomes clear when each group of consumers uses its own Using same ClientID for all consumers leads to randomness. Also looks like kafka itself thinks that all partitions were assigned after the last rebalance. |
I'll probably be able to look at this on Friday. |
Was looking into this a bit and what I think is happening is:
And I think that leaves us in a situation where I tried nil-ing out lastAssigned in I'll keep looking and trying to understand what's happening here 馃槃 |
The fix is accurate, and the diagnosis is almost correct. The final step -- set as OwnedPartitions -- is a red herring. OwnedPartitions is used by the sticky balancer to guard against zombies (I'd have to read the code more to refresh truly what this guards). The bug is right here: franz-go/pkg/kgo/consumer_group.go Lines 588 to 590 in 351e7fa
It's ok to nil out lastAssigned, because it's truly meant for tracking state between rebalances for cooperative group balancers specifically. It's not important to keep around the prior state for an eager balancer because well, for an eager balancer, there isn't meant to be any prior state at the start of every group session. (also sorry for the delay in looking into this) |
@twmb was thinking of writing up a PR to include the fix along with a test, would that be helpful? I'm wondering the best way to go about writing the test and was thinking of doing something similar to what |
It's helpful. I don't think I fixed this in a branch locally. My own holdup on fixing this myself is that one of the KIPs is harder to implement than I thought. I spent some time on a plane implementing the fix and the more I worked on it, the bigger the scope turned out to be. I've set it aside and have been prioritizing some of my own stuff for a bit lately, so work has been essentially frozen. I aim to get back to this stuff sooner than later; if you go ahead and implement the fix and test before I get to it, I'll appreciate it -- no timeline on merging and releasing yet (though if I take too too long, I'll just go ahead and do a bugfix release). |
Awesome, will give it a go 馃憤
Are you referring to the fix for this issue or the KIP you were working on? If it's the fix then I assume that means there's more to it than simply nil-ing out |
I'm referring to "KIP-951 - Leader discovery optimisations for the client". The client isn't implemented in a way to "move" partitions internally outside of the metadata loop, so hooking into this properly has been a PITA. For this bug, niling out |
Hello 馃憢!
We've been using
franz-go
for a while and have our consumers currently configured with theRangeBalancer
. We want to switch to theCooperativeStickyBalancer
and were trying to follow the instructions listed in both KIP-429 and thefranz-go
docs for CooperativeStickyBalancer which state that we essentially need to perform a double bounce to upgrade. However we noticed that during the first bounce where we addcooperative-sticky
to our set of balancers, when our oldrange
consumers would leave the consumer group we kept getting consumption lag on some of the partitions.There's a minimal reproduction in this repo here.
In the reproduction we:
[RangeBalancer]
balancer[CooperativeStickyBalancer, RangeBalancer]
balancers[RangeBalancer]
consumerWe also see that the issue seems to persist if we then spin up a new consumer that uses the
[CooperativeStickyBalancer, RangeBalancer]
balancers, some partitions are seemingly never re-assigned and are stuck. We found that performing a full restart of all the consumers in the consumer group once we're in this state fixes the issue (i.e. unsticks the partitions).Let me know if we're missing something here!
The text was updated successfully, but these errors were encountered: