Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Skip calculate lag when a topic has no offset commits #718

Closed
wants to merge 1 commit into from

Conversation

kylege
Copy link

@kylege kylege commented May 8, 2024

If topic has no commit offset for a consumer group, then we should not calculate lag for it.

For example, topic A current high water offset is 100, but has no message because of delete mode

And there will not be any offset commits for consumer group because there is no messages to consume

Function CalculateGroupLag will calculate lag 100 for the consumer group, which is wrong.

@twmb
Copy link
Owner

twmb commented May 8, 2024

The current code is deliberate. If a consumer in a group is assigned a topic, then if that topic/partition has no commit, it means the entire topic/partition is lagged.

The original implementation did not consider topics with no commits as candidates to have lag. This was problematic because people were trying to calculate lag for a group that was bugged and never actually committed. The lag showed as zero, but technically by definition of how lag is calculated, the calculation should have shown the entire topic lagging.

@kylege
Copy link
Author

kylege commented May 9, 2024

The current code is deliberate. If a consumer in a group is assigned a topic, then if that topic/partition has no commit, it means the entire topic/partition is lagged.

The original implementation did not consider topics with no commits as candidates to have lag. This was problematic because people were trying to calculate lag for a group that was bugged and never actually committed. The lag showed as zero, but technically by definition of how lag is calculated, the calculation should have shown the entire topic lagging.

We have too may topics that cleanup.policy=delete, so the topics will have zero messages after some time, when a consumer starts, it will never commit offset because no messages consumed, so the current code will calculate a very big lag number other than zero.

So how can i fix this scenario? Should i implement the calculation myself like kowl did?

Maybe we can get partition water marks to get how many messages in topic, but this method requires more requests to kafka server.

@twmb
Copy link
Owner

twmb commented May 9, 2024

We've actually had multiple reports of Kowl's (now Console) lag page being inaccurate -- for basically the reason that I do it differently in the code above.

However, I think given what you're saying, a different fix is to subtract the log start offset from the lag. If a topic has log end offset 30 and log start offset 30 and no commit, I think today it'll show a lag of 30, whereas this can be changed to show a lag of 0.

@twmb twmb added TODO waiting kadm Anything related to kadm specifically and removed TODO waiting labels May 23, 2024
twmb added a commit that referenced this pull request May 26, 2024
…on-zero

This adds one more round trip in the `Lag` function to request start
offsets, but this improves the case when topics have segments deleted
over time. We _could_ go through some detailed way of avoiding requests
for partitions that have commits, but having the start offset is pretty
valuable a lot of time time anyway -- and we keep the logic simpler.

Now,
* if a topic is assigned to a group,
* and a partition in that topic has no commits,
* and the log-start-offset for the partition is non-zero
Lag will no longer show the entire partition as lagging from zero, but
instead will show more accurately the log-end-offset - log-start-offset.
If LEO == LSO, the lag is now zero, rather than LEO itself.

This also no longer says lag is negative if a commit is past the end of
a partition; instead the lag is zero.

Closes #718.
twmb added a commit that referenced this pull request May 26, 2024
…on-zero

This adds one more round trip in the `Lag` function to request start
offsets, but this improves the case when topics have segments deleted
over time. We _could_ go through some detailed way of avoiding requests
for partitions that have commits, but having the start offset is pretty
valuable a lot of time time anyway -- and we keep the logic simpler.

Now,
* if a topic is assigned to a group,
* and a partition in that topic has no commits,
* and the log-start-offset for the partition is non-zero
Lag will no longer show the entire partition as lagging from zero, but
instead will show more accurately the log-end-offset - log-start-offset.
If LEO == LSO, the lag is now zero, rather than LEO itself.

This also no longer says lag is negative if a commit is past the end of
a partition; instead the lag is zero.

Closes #718.
@twmb twmb closed this in #744 May 26, 2024
@twmb
Copy link
Owner

twmb commented May 26, 2024

Check out #744, I think that solves this. I'm going to release kadm soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kadm Anything related to kadm specifically TODO
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants