Consolidate on references instead of owned data #27157

antiguru · 2024-05-17T17:37:34Z

The current implementation converts a TupleRef iterator into a vector and consolidates that. This might cause extra work if the data consolidates well because it needs to convert the whole input into owned data.

With this change, we first collect the references into a vector, consolidate this, and then convert into an owned allocation.

I didn't measure the performance impact, but I saw this as a source of memory allocations in https://pprof.me/c3516597e535d8c64a2184c143c408db

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

The current implementation converts a `TupleRef` iterator into a vector and consolidates that. This might cause extra work if the data consolidates well because it needs to convert the whole input into owned data. With this change, we first collect the references into a vector, consolidate this, and then convert into an owned allocation. Signed-off-by: Moritz Hoffmann <mh@materialize.com>

bkirwi · 2024-05-17T18:28:07Z

I'd have a hard time how much this specific change will help - in many cases the inputs to this function should have few duplicates, and in that case this is a slight pessimization. (We're consolidating mostly to get the data in sorted order.)

We could almost certainly get away with allocating & sorting a Vec of indices into the data instead of the data itself, though. That's a more invasive change but would eliminate all these small allocations entirely.

It's a bit unusual to see so much data going through this function, though I know of a couple cases where it's possible. (Load generators and source snapshots, mostly.) Can you say a bit more about the workload you were running that generated this trace?

antiguru · 2024-05-17T18:30:46Z

It's a bit unusual to see so much data going through this function, though I know of a couple cases where it's possible. (Load generators and source snapshots, mostly.) Can you say a bit more about the workload you were running that generated this trace?

It was a load generator someone else was running!

bkirwi · 2024-05-20T23:28:20Z

Sketched out what I was talking about above: #27168

The idea is that you only allocate a collection of indices into the original collection, not a vec-of-vecs. Should be less allocation overall but also less fragmentation, since we're not allocating a ton of small vecs. Also haven't benched it yet however...

antiguru · 2024-05-21T00:03:19Z

Cool, I'll close this one in favor of your change! Thanks very much!

antiguru requested a review from a team as a code owner May 17, 2024 17:37

danhhz requested a review from bkirwi May 17, 2024 17:42

antiguru closed this May 21, 2024

bkirwi mentioned this pull request May 22, 2024

[persist] Store a vector of cursors instead of a vector of vecs #27168

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate on references instead of owned data #27157

Consolidate on references instead of owned data #27157

antiguru commented May 17, 2024

bkirwi commented May 17, 2024

antiguru commented May 17, 2024

bkirwi commented May 20, 2024

antiguru commented May 21, 2024

Consolidate on references instead of owned data #27157

Consolidate on references instead of owned data #27157

Conversation

antiguru commented May 17, 2024

Checklist

bkirwi commented May 17, 2024

antiguru commented May 17, 2024

bkirwi commented May 20, 2024

antiguru commented May 21, 2024