Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split a per-partition WriteRequest into multiple Kafka records if bigger than max allowed size #8077

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented May 7, 2024

What this PR does

To be written...

Which issue(s) this PR fixes or relates to

N/A

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

…ger than max allowed size

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this PR an early look, and it would work.

I wonder if it would have been easier to work at serialized-message level, splitting the message by fields with tag 1 (timeseries) and tag 3 (metadata) until they fill the size, while copying tag 2 (source) and 1000 (skip_label_name_validation) into each submessage.

}

// We assume that different timeseries roughly have the same size (no huge outliers)
// so we preallocate the returned slice just adding 1 extra item (+2 because a +1 is to round up).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could understand +1, but why +2 again?

Copy link
Collaborator Author

@pracucci pracucci May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to round up, and +1 for an extra item. The +1 round up doesn't guarantees us space for 1 extra item (it depends what was the reminder of the division), but it's guaranteed by the 2nd +2. Does this answer your question?

return []*WriteRequest{partialReq}
}

// We assume that different timeseries roughly have the same size (no huge outliers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that size of each timeseries is dominated by labels, I have doubts that this assumption holds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice we split into 16MB partial requests. At this scale, the size of labels shouldn't matter much.

if partialReqSize+seriesSize > maxSize && !partialReq.IsEmpty() {
// The current partial request is full (or close to be full), so we create a new one.
partialReqs = append(partialReqs, partialReq)
partialReq = newPartialReq(estimatedTimeseriesPerPartialReq)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we need to reset partialReqSize here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn yes. It was a bad bug. Fixed in 0379d8b

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants