[SPARK-19426][SQL] Custom coalescer for Dataset #46541

SubhamSinghal · 2024-05-12T14:36:15Z

What changes were proposed in this pull request?

This pr adds a new API for coalesce in Dataset; users can specify the custom coalescer which reduces an input Dataset into fewer partitions. This coalescer implementation is the same with the one in RDD#coalesce added in #11865 (SPARK-14042).

This is the rework of #18861.

How was this patch tested?

Added tests in DatasetSuite.

hvanhovell · 2024-05-14T14:07:43Z

Can you walk me through the actual use case for this? Coalesce - historically - is incredibly hard to use for most end user, so before adding this I'd like to understand why.

SubhamSinghal · 2024-05-14T15:59:20Z

Coalesce does not enforce uniform data distribution across partitions. We would like to pass custom size based coalescer to have more uniform data distribution. This would avoid using repartition and shuffle at places.
Custom coalesce support is available in RDD and it would be better to have this in Dataframe as well.

SubhamSinghal · 2024-05-21T05:54:49Z

@hvanhovell will you be able to add review here or tag relevant folks?

subham611 added 2 commits May 12, 2024 10:29

Add support for custom partitionCoalescer

bb43535

fix uts

db71631

github-actions bot added SQL CORE PYTHON labels May 12, 2024

subham611 added 6 commits May 12, 2024 20:27

fix scala lint

e8f089d

Fix lint issue

d52473d

Fix UT

ba3c963

Fix UT

192f2f7

Adds UT in CollapseRepartitionSuite

fa6ed6f

Fix lint

f3ebeb2

SubhamSinghal changed the title ~~[SPARK-19426][SQL][WIP] Custom coalescer for Dataset~~ [SPARK-19426][SQL] Custom coalescer for Dataset May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

SubhamSinghal commented May 12, 2024

hvanhovell commented May 14, 2024

SubhamSinghal commented May 14, 2024 •

edited

SubhamSinghal commented May 21, 2024 •

edited

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

Are you sure you want to change the base?

[SPARK-19426][SQL] Custom coalescer for Dataset #46541

Conversation

SubhamSinghal commented May 12, 2024

What changes were proposed in this pull request?

How was this patch tested?

hvanhovell commented May 14, 2024

SubhamSinghal commented May 14, 2024 • edited

SubhamSinghal commented May 21, 2024 • edited

SubhamSinghal commented May 14, 2024 •

edited

SubhamSinghal commented May 21, 2024 •

edited