Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: rework decoding to fix bugs in nested struct decoding #2337

Merged
merged 2 commits into from
May 17, 2024

Conversation

westonpace
Copy link
Contributor

This PR modifies both scheduling and the decode stream. These modifications should mainly affect the logical encodings (list / struct). The changes to the other encodings are fairly minimal.

In the scheduler we now track where we are in the field tree. This is both for debugging purposes (so we can log in trace messages the current path) and because we now need to send the path as part of the message we send to the decode stream.

The decoder changes are more significant. Previously, we combined waiting for I/O and waiting for additional encoders into the same phase. This logic was more complex, but more importantly, it also assumed that the decoder could recreate the order in which the scheduler scheduled pages. For example, if we scan through the columns and encounter one that needs more data, then we grab exactly one page, and continue the column scan, grabbing the next page when we come by in another pass. This works in many cases but doesn't work in others. For example, the scheduler might schedule the many pages in a row for the same column if that column is wide or the page size is small. The decoder would get out of sync in these cases.

Now, the logic is simpler, and more correct. The decoder first waits for enough scheduling to be done that a batch can be delivered. During this pass we drain encoders from the scheduler and insert them into the current decoders not by "current position in the decode" but by the path that is now included with the decoder.

Once enough scheduling has been done we then wait for I/O.

Once I/O is done we then drain the decoders in the same fashion we did before.

@westonpace westonpace marked this pull request as draft May 15, 2024 15:16
@westonpace
Copy link
Contributor Author

westonpace commented May 15, 2024

Leaving in draft until #2331 is complete

@codecov-commenter
Copy link

codecov-commenter commented May 15, 2024

Codecov Report

Attention: Patch coverage is 70.46784% with 101 lines in your changes are missing coverage. Please review.

Project coverage is 80.77%. Comparing base (00cda83) to head (7698a10).

Files Patch % Lines
rust/lance-encoding/src/decoder.rs 72.14% 34 Missing and 5 partials ⚠️
...-encoding/src/encodings/logical/fixed_size_list.rs 0.00% 21 Missing ⚠️
...ust/lance-encoding/src/encodings/logical/struct.rs 82.88% 14 Missing and 5 partials ⚠️
rust/lance-encoding/src/encodings/logical/list.rs 65.71% 6 Missing and 6 partials ⚠️
...ust/lance-encoding/src/encodings/logical/binary.rs 73.33% 3 Missing and 1 partial ⚠️
.../lance-encoding/src/encodings/logical/primitive.rs 66.66% 3 Missing ⚠️
rust/lance-file/src/v2/reader.rs 0.00% 1 Missing and 1 partial ⚠️
rust/lance-encoding/src/testing.rs 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2337      +/-   ##
==========================================
- Coverage   80.80%   80.77%   -0.04%     
==========================================
  Files         192      192              
  Lines       56189    56343     +154     
  Branches    56189    56343     +154     
==========================================
+ Hits        45403    45509     +106     
- Misses       8117     8169      +52     
+ Partials     2669     2665       -4     
Flag Coverage Δ
unittests 80.77% <70.46%> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

… waiting for scheduled decoders and waiting for IO into two different phases. See PR for more details.
@westonpace westonpace marked this pull request as ready for review May 17, 2024 13:42
current_top_level_row - top_level_row
);
context
.sink
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we have a similar method on context like emit to send a ScanLine to the sink?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of intentionally hid this because encodings shouldn't be able to send scan lines. The struct encoding is a "special" encoding because we will always have one struct encoding that lives at the top of the decoder tree (even if there is only one column). In the next PR I make this even more obvious. So, it is somewhat intentional here that the struct encoding has to peek into the private impl of the scheduler context.

Copy link
Contributor

@albertlockett albertlockett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@westonpace westonpace merged commit 648adf8 into lancedb:main May 17, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants