[FEATURE] support generating larger block size during shuffle map task by spill partial partitions data #1594

leslizhang · 2024-03-20T14:13:26Z

Code of Conduct

I agree to follow this project's Code of Conduct

Search before asking

I have searched in the issues and found no similar issues.

Describe the feature

At least three situations can trigger a large number of small blocks in shuffle map task:

Data skew: When the data processed by the map task mostly belongs to a few reduce tasks, the sum of the data for the few reduce tasks exceeds the spill threshold, causing the remaining long-tail reduce tasks to spill together passively.
The number of reduce tasks is extremely large (>10k): When each reduce partition has very little data, it will also cause the buffer of the ShuffleBufferManager of the shuffle map task to be occupied beyond the spill threshold, thereby producing a large number of small blocks.
When an executor can run multiple tasks at the same time, there is memory competition between tasks. At this time, some tasks may not get enough memory because they start later than other tasks, but the overall memory of the executor is insufficient to cause ShuffleBufferManager to spill as a MemoryConsumer. In fact, ShuffleBufferManager may only store a small amount of data for each reduce partition.

Problems caused by small blocks:

Unnecessary network overhead: The amount of data sent in a single transmission is reduced, increasing the number of network interactions between the executor and the shuffle server. The number of interactions can expand up to 10 times or even 100 times.
Increase the size of the index file when the shuffle server stores shuffle data persistently.

Motivation

No response

Describe the solution

when spilling shuffle data, we just spill part of the reduce partition datas which hold the major space.
so, in each spilling process, the WriteBufferManager.clear() method should implement one more logic: sort the to-be spilled buffers by their size and select the top-N buffers to spill.

Additional context

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

zuston · 2024-03-22T01:53:51Z

+1.

EnricoMi · 2024-03-22T15:38:12Z

Sorting the partition data by the reduce partition id should maximize block sizes. Is that an option?

leslizhang · 2024-04-28T07:53:15Z

With this feature, the buffer size of a single partition(spark.rss.writer.buffer.size) can be increased to a larger value, for example, from the current default configuration of 3MB to 10MB or more.

…fle map task by spill partial partitions data (#1670) ### What changes were proposed in this pull request? when spilling shuffle data, we just spill part of the reduce partition datas which hold the major space. so, in each spilling process, the WriteBufferManager.clear() method should implement one more logic: sort the to-be spilled buffers by their size and select the top-N buffers to spill. ### Why are the changes needed? related feature #1594 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? new UTs. --------- Co-authored-by: leslizhang <leslizhang@tencent.com>

zuston assigned zuston and leslizhang and unassigned zuston Mar 22, 2024

leslizhang mentioned this issue Apr 28, 2024

[#1594] improvement(client):support generating larger block size during shuffle map task by spill partial partitions data #1670

Merged

rickyma mentioned this issue May 21, 2024

[Improvement] Introduce local allocation buffer to store blocks in memory #1727

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] support generating larger block size during shuffle map task by spill partial partitions data #1594

[FEATURE] support generating larger block size during shuffle map task by spill partial partitions data #1594

leslizhang commented Mar 20, 2024

zuston commented Mar 22, 2024

EnricoMi commented Mar 22, 2024

leslizhang commented Apr 28, 2024

[FEATURE] support generating larger block size during shuffle map task by spill partial partitions data #1594

[FEATURE] support generating larger block size during shuffle map task by spill partial partitions data #1594

Comments

leslizhang commented Mar 20, 2024

Code of Conduct

Search before asking

Describe the feature

Motivation

Describe the solution

Additional context

Are you willing to submit PR?

zuston commented Mar 22, 2024

EnricoMi commented Mar 22, 2024

leslizhang commented Apr 28, 2024