-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement chunked parquet reader in cudf-python #15728
base: branch-24.08
Are you sure you want to change the base?
Conversation
/okay to test |
@GregoryKimball This PR is ready for review, I'll add the chunked concat and then enable using chunked parquet reader in |
Thank you @galipremsagar! This looks like a great addition, the debut of chunked parquet reading to cudf python ❤️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Just a heads up:
Eventually, we'll probably want the binding for this to live in pylibcudf (so we'd need to rewrite the stuff added in this PR again at a later date).
Unfortunately, bindings for I/O haven't landed in the dev branch yet (I just started porting over a bunch of the classes we'd need for I/O like TableWithMetadata
in #15899).
I think I'll be able to get round to this in a couple weeks time after my PR lands, but I think it's still OK to put this in before then, even if we need to rewrite it a bit for pylibcudf later.
else: | ||
range_index_meta = self.index_col[0] | ||
|
||
if self.row_groups is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is any of the implementation shareable with the non-chunked parquet reader?
|
||
@pytest.mark.parametrize("chunk_read_limit", [0, 240, 1024000000]) | ||
@pytest.mark.parametrize("pass_read_limit", [0, 240, 1024000000]) | ||
def test_parquet_chunked_reader(chunk_read_limit, pass_read_limit): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind fleshing out this test to test some of the other parameters (e.g. row_groups, filtering, etc.)
When I get around to implementing parquet in pylibcudf, I'll probably end up stealing your test :)
Description
Partially Addresses: #14966
This PR implements chunked parquet bindings in python.
Checklist