Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chunked parquet reader in cudf-python #15728

Open
wants to merge 12 commits into
base: branch-24.08
Choose a base branch
from

Conversation

galipremsagar
Copy link
Contributor

@galipremsagar galipremsagar commented May 12, 2024

Description

Partially Addresses: #14966

This PR implements chunked parquet bindings in python.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the cuDF (Python) Affects Python cuDF API. label May 12, 2024
@galipremsagar galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 22, 2024
Copy link

copy-pr-bot bot commented May 30, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue conda conda issue cuDF (Java) Affects Java cuDF API. ci labels May 30, 2024
@galipremsagar galipremsagar changed the base branch from branch-24.06 to branch-24.08 May 30, 2024 18:31
@galipremsagar galipremsagar marked this pull request as ready for review May 30, 2024 18:31
@galipremsagar galipremsagar requested a review from a team as a code owner May 30, 2024 18:31
@galipremsagar
Copy link
Contributor Author

/okay to test

@galipremsagar galipremsagar removed the libcudf Affects libcudf (C++/CUDA) code. label May 30, 2024
@galipremsagar galipremsagar added 3 - Ready for Review Ready for review by team and removed CMake CMake build issue conda conda issue cuDF (Java) Affects Java cuDF API. ci labels May 30, 2024
@galipremsagar
Copy link
Contributor Author

@GregoryKimball This PR is ready for review, I'll add the chunked concat and then enable using chunked parquet reader in cudf.pandas in a follow-up PR.

@GregoryKimball
Copy link
Contributor

Thank you @galipremsagar! This looks like a great addition, the debut of chunked parquet reading to cudf python ❤️

Copy link
Contributor

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

Just a heads up:
Eventually, we'll probably want the binding for this to live in pylibcudf (so we'd need to rewrite the stuff added in this PR again at a later date).

Unfortunately, bindings for I/O haven't landed in the dev branch yet (I just started porting over a bunch of the classes we'd need for I/O like TableWithMetadata in #15899).

I think I'll be able to get round to this in a couple weeks time after my PR lands, but I think it's still OK to put this in before then, even if we need to rewrite it a bit for pylibcudf later.

else:
range_index_meta = self.index_col[0]

if self.row_groups is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any of the implementation shareable with the non-chunked parquet reader?


@pytest.mark.parametrize("chunk_read_limit", [0, 240, 1024000000])
@pytest.mark.parametrize("pass_read_limit", [0, 240, 1024000000])
def test_parquet_chunked_reader(chunk_read_limit, pass_read_limit):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind fleshing out this test to test some of the other parameters (e.g. row_groups, filtering, etc.)

When I get around to implementing parquet in pylibcudf, I'll probably end up stealing your test :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuDF (Python) Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Status: In Progress
Status: Slip
Development

Successfully merging this pull request may close these issues.

None yet

3 participants