Consumer group rebalancing bug when switching from eager to cooperative consumers #686

hamdanjaveed · 2024-03-06T18:16:42Z

Hello 👋!

We've been using franz-go for a while and have our consumers currently configured with the RangeBalancer. We want to switch to the CooperativeStickyBalancer and were trying to follow the instructions listed in both KIP-429 and the franz-go docs for CooperativeStickyBalancer which state that we essentially need to perform a double bounce to upgrade. However we noticed that during the first bounce where we add cooperative-sticky to our set of balancers, when our old range consumers would leave the consumer group we kept getting consumption lag on some of the partitions.

There's a minimal reproduction in this repo here.

In the reproduction we:

Spin up kafka and a producer that publishes events
Spin up a consumer that uses the [RangeBalancer] balancer
Spin up a consumer that uses the [CooperativeStickyBalancer, RangeBalancer] balancers
Shutdown the [RangeBalancer] consumer
The remaining consumer seems to revoke half of its partitions but never re-assigns them to itself

We also see that the issue seems to persist if we then spin up a new consumer that uses the [CooperativeStickyBalancer, RangeBalancer] balancers, some partitions are seemingly never re-assigned and are stuck. We found that performing a full restart of all the consumers in the consumer group once we're in this state fixes the issue (i.e. unsticks the partitions).

Let me know if we're missing something here!

The text was updated successfully, but these errors were encountered:

YellowCataclysm · 2024-03-07T06:37:53Z

Got the same issue on 1.14.0, 1.15.0
Tested with franz coop consumers + sarama and franz coop consumers + franz-go consumers with RR/Range balancers. Same effect.
My steps was:

Create topic with 12 partitions
Start 12 eager consumers(sarama/franz-go). Wait a few seconds -> First rebalance
Start 12 cooperative consumers(franz-go) -> Second rebalance
Shutdown eager consumers - Third rebalance
Consumers of partitions that were assigned to a CoopSticky clients on step 2 (before shutdown of Eager consumers) stucks right after third rebalance.

It becomes clear when each group of consumers uses its own ClientID as all consumers in Range assignment are sorted by MemberID before assignment. So the problem is stable to be reproduced if cooperative consumers MemberID(that is ClientID+suffix) appears earlier than eager ones after sorting.
For example lets say coop-consumers has ClientID = "a-consumer" and eager-consumers has ClientID = "z-consumer". This configuration stucks all the partitions after last rebalance in my example.
In the opposite case(coop client id = z-consumer , eager client id = a-consumer) all the consumers works properly after last rebalance.

Using same ClientID for all consumers leads to randomness.

Also looks like kafka itself thinks that all partitions were assigned after the last rebalance.
As mentioned above, full restart helps. Rolling restart also helps (start new consumers and then shutdown old) - looks like its because of leader change.

twmb · 2024-03-13T05:37:33Z

I'll probably be able to look at this on Friday.

hamdanjaveed · 2024-03-23T08:49:48Z

Was looking into this a bit and what I think is happening is:

Say we have a consumer c0 with [RangeBalancer] balancers in a consumer group consuming from a topic t0 with partitions [t0p0, t0p1]
A new consumer c1 joins the consumer group with the [CooperativeStickyBalancer, RangeBalancer] balancers, say the assignment is now c0: [t0p0], c1: [t0p1]
Consumer c0 leaves the consumer group
This causes consumer c1 to eagerly revoke its prior assigned partitions (which would be t0p1) by nil-ing out nowAssigned
However lastAssigned remains populated with the old assignment (t0p1)
When that cooperative consumer now continues its consumer group rebalance it uses its lastAssigned=[t0p1] as its current assignment
This gets sent as part of the JoinGroupMetadata as the currentAssignments which gets set as the OwnedPartitions for consumer c1

And I think that leaves us in a situation where c1 has revoked its previously owned partitions (t0p1) but performs the next rebalance thinking it still owns them and only adds the partitions it thinks it doesn't own (t0p0).

I tried nil-ing out lastAssigned in groupConsumer::revoke which seems to fix the locally reproducible issue from my repo but I have no confidence that that's correct (and if I had to guess, it's probably not).

I'll keep looking and trying to understand what's happening here 😄

twmb · 2024-03-26T02:47:48Z

The fix is accurate, and the diagnosis is almost correct. The final step -- set as OwnedPartitions -- is a red herring. OwnedPartitions is used by the sticky balancer to guard against zombies (I'd have to read the code more to refresh truly what this guards).

The bug is right here:

franz-go/pkg/kgo/consumer_group.go

Lines 588 to 590 in 351e7fa

    
           if _, exists := g.lastAssigned[topic]; !exists { 
        
           	added[topic] = nowPartitions 
        
           }

It's ok to nil out lastAssigned, because it's truly meant for tracking state between rebalances for cooperative group balancers specifically. It's not important to keep around the prior state for an eager balancer because well, for an eager balancer, there isn't meant to be any prior state at the start of every group session.

(also sorry for the delay in looking into this)

hamdanjaveed · 2024-04-22T18:18:08Z

@twmb was thinking of writing up a PR to include the fix along with a test, would that be helpful? I'm wondering the best way to go about writing the test and was thinking of doing something similar to what testChainETL() does in helpers_test.go but instead spinning up and shutting down consumers that have different GroupBalancers. Tried to re-use the existing testConsumer but it felt a bit clunky in the context of this test and wasn't working how I'd expect it to work. Would you have any thoughts on how to best approach writing a test for this?

twmb · 2024-04-24T19:09:29Z

It's helpful. I don't think I fixed this in a branch locally. My own holdup on fixing this myself is that one of the KIPs is harder to implement than I thought. I spent some time on a plane implementing the fix and the more I worked on it, the bigger the scope turned out to be. I've set it aside and have been prioritizing some of my own stuff for a bit lately, so work has been essentially frozen. I aim to get back to this stuff sooner than later; if you go ahead and implement the fix and test before I get to it, I'll appreciate it -- no timeline on merging and releasing yet (though if I take too too long, I'll just go ahead and do a bugfix release).

hamdanjaveed · 2024-04-24T19:28:59Z

Awesome, will give it a go 👍

I spent some time on a plane implementing the fix and the more I worked on it, the bigger the scope turned out to be

Are you referring to the fix for this issue or the KIP you were working on?

If it's the fix then I assume that means there's more to it than simply nil-ing out lastAssigned

twmb · 2024-04-24T19:31:22Z

I'm referring to "KIP-951 - Leader discovery optimisations for the client". The client isn't implemented in a way to "move" partitions internally outside of the metadata loop, so hooking into this properly has been a PITA.

For this bug, niling out lastAssigned is all that's necessary, with the reasoning I above in that comment (i.e. not due to OwnedPartitions, but due to a different reason).

Fixes twmb#686

hamdanjaveed changed the title ~~Consumer group rebalancing bug when using a mix of eager and cooperative GroupBalancers~~ Consumer group rebalancing bug when switching from eager to cooperative consumers Mar 6, 2024

hamdanjaveed added a commit to hamdanjaveed/franz-go that referenced this issue Mar 23, 2024

This seems to fix twmb#686 but I have no confidence that this is correct

849a106

twmb added the bug Something isn't working label Mar 26, 2024

twmb mentioned this issue Mar 26, 2024

v1.17 release status #698

Closed

12 tasks

hamdanjaveed added a commit to hamdanjaveed/franz-go that referenced this issue Apr 18, 2024

This seems to fix twmb#686 but I have no confidence that this is correct

8724a6b

hamdanjaveed added a commit to hamdanjaveed/franz-go that referenced this issue May 8, 2024

fix: clear lastAssigned when revoking eager consumer

871e384

Fixes twmb#686

hamdanjaveed added a commit to hamdanjaveed/franz-go that referenced this issue May 8, 2024

fix: clear lastAssigned when revoking eager consumer

4343bec

Fixes twmb#686

hamdanjaveed added a commit to hamdanjaveed/franz-go that referenced this issue May 8, 2024

fix: clear lastAssigned when revoking eager consumer

2fbbda5

Fixes twmb#686

hamdanjaveed mentioned this issue May 8, 2024

fix: clear lastAssigned when revoking eager consumer #720

Merged

twmb added the has pr label May 9, 2024

twmb closed this as completed in #720 May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumer group rebalancing bug when switching from eager to cooperative consumers #686

Consumer group rebalancing bug when switching from eager to cooperative consumers #686

hamdanjaveed commented Mar 6, 2024 •

edited

YellowCataclysm commented Mar 7, 2024 •

edited

twmb commented Mar 13, 2024

hamdanjaveed commented Mar 23, 2024 •

edited

twmb commented Mar 26, 2024

hamdanjaveed commented Apr 22, 2024

twmb commented Apr 24, 2024

hamdanjaveed commented Apr 24, 2024 •

edited

twmb commented Apr 24, 2024 •

edited

Consumer group rebalancing bug when switching from eager to cooperative consumers #686

Consumer group rebalancing bug when switching from eager to cooperative consumers #686

Comments

hamdanjaveed commented Mar 6, 2024 • edited

YellowCataclysm commented Mar 7, 2024 • edited

twmb commented Mar 13, 2024

hamdanjaveed commented Mar 23, 2024 • edited

twmb commented Mar 26, 2024

hamdanjaveed commented Apr 22, 2024

twmb commented Apr 24, 2024

hamdanjaveed commented Apr 24, 2024 • edited

twmb commented Apr 24, 2024 • edited

hamdanjaveed commented Mar 6, 2024 •

edited

YellowCataclysm commented Mar 7, 2024 •

edited

hamdanjaveed commented Mar 23, 2024 •

edited

hamdanjaveed commented Apr 24, 2024 •

edited

twmb commented Apr 24, 2024 •

edited