Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consolidate on references instead of owned data #27157

Closed

Conversation

antiguru
Copy link
Member

The current implementation converts a TupleRef iterator into a vector and consolidates that. This might cause extra work if the data consolidates well because it needs to convert the whole input into owned data.

With this change, we first collect the references into a vector, consolidate this, and then convert into an owned allocation.

I didn't measure the performance impact, but I saw this as a source of memory allocations in https://pprof.me/c3516597e535d8c64a2184c143c408db

Checklist

The current implementation converts a `TupleRef` iterator into a vector and
consolidates that. This might cause extra work if the data consolidates
well because it needs to convert the whole input into owned data.

With this change, we first collect the references into a vector,
consolidate this, and then convert into an owned allocation.

Signed-off-by: Moritz Hoffmann <mh@materialize.com>
@antiguru antiguru requested a review from a team as a code owner May 17, 2024 17:37
@danhhz danhhz requested a review from bkirwi May 17, 2024 17:42
@bkirwi
Copy link
Contributor

bkirwi commented May 17, 2024

I'd have a hard time how much this specific change will help - in many cases the inputs to this function should have few duplicates, and in that case this is a slight pessimization. (We're consolidating mostly to get the data in sorted order.)

We could almost certainly get away with allocating & sorting a Vec of indices into the data instead of the data itself, though. That's a more invasive change but would eliminate all these small allocations entirely.

It's a bit unusual to see so much data going through this function, though I know of a couple cases where it's possible. (Load generators and source snapshots, mostly.) Can you say a bit more about the workload you were running that generated this trace?

@antiguru
Copy link
Member Author

It's a bit unusual to see so much data going through this function, though I know of a couple cases where it's possible. (Load generators and source snapshots, mostly.) Can you say a bit more about the workload you were running that generated this trace?

It was a load generator someone else was running!

@bkirwi
Copy link
Contributor

bkirwi commented May 20, 2024

Sketched out what I was talking about above: #27168

The idea is that you only allocate a collection of indices into the original collection, not a vec-of-vecs. Should be less allocation overall but also less fragmentation, since we're not allocating a ton of small vecs. Also haven't benched it yet however...

@antiguru
Copy link
Member Author

Cool, I'll close this one in favor of your change! Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants