Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

SubhamSinghal
Copy link

What changes were proposed in this pull request?

This pr adds a new API for coalesce in Dataset; users can specify the custom coalescer which reduces an input Dataset into fewer partitions. This coalescer implementation is the same with the one in RDD#coalesce added in #11865 (SPARK-14042).

This is the rework of #18861.

How was this patch tested?

Added tests in DatasetSuite.

@SubhamSinghal SubhamSinghal changed the title [SPARK-19426][SQL][WIP] Custom coalescer for Dataset [SPARK-19426][SQL] Custom coalescer for Dataset May 13, 2024
@hvanhovell
Copy link
Contributor

Can you walk me through the actual use case for this? Coalesce - historically - is incredibly hard to use for most end user, so before adding this I'd like to understand why.

@SubhamSinghal
Copy link
Author

SubhamSinghal commented May 14, 2024

Coalesce does not enforce uniform data distribution across partitions. We would like to pass custom size based coalescer to have more uniform data distribution. This would avoid using repartition and shuffle at places.
Custom coalesce support is available in RDD and it would be better to have this in Dataframe as well.

@SubhamSinghal
Copy link
Author

SubhamSinghal commented May 21, 2024

@hvanhovell will you be able to add review here or tag relevant folks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants