[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

zuston · 2024-03-15T03:14:03Z

What changes were proposed in this pull request?

clear out previous stage attempt data synchronously when registering the re-assignment shuffleIds.

Why are the changes needed?

If the previous stage attempt is in the purge queue in shuffle-server side, the retry stage writing will cause
unknown exceptions, so we'd better to clear out all previous stage attempt data before re-registering

This PR is to sync remove previous stage data when the first attempt writer is initialized.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

zuston · 2024-03-15T03:18:41Z

cc @dingshun3016 @yl09099 PTAL

github-actions · 2024-03-15T03:27:35Z

Test Results

2 419 files ±0 2 419 suites ±0 4h 58m 9s ⏱️ +16s
933 tests ±0 932 ✅ ±0 1 💤 ±0 0 ❌ ±0
10 819 runs ±0 10 805 ✅ ±0 14 💤 ±0 0 ❌ ±0

Results for commit 3dd6b34. ± Comparison against base commit a0e88da.

♻️ This comment has been updated with latest results.

zuston · 2024-03-15T03:54:16Z

After rethinking this, I think the reassignAllShuffleServersForWholeStage could be invoked by the retry writer rather than previous failed writer that could ensure no older data into server after re-register.

codecov-commenter · 2024-03-22T08:50:59Z

Codecov Report

Attention: Patch coverage is 3.29670% with 352 lines in your changes are missing coverage. Please review.

Project coverage is 53.42%. Comparing base (6f6d35a) to head (5c9d9e3).
Report is 34 commits behind head on master.

Files	Patch %	Lines
...uniffle/shuffle/manager/RssShuffleManagerBase.java	0.00%	187 Missing ⚠️
.../shuffle/handle/StageAttemptShuffleHandleInfo.java	0.00%	43 Missing ⚠️
...pache/uniffle/server/ShuffleServerGrpcService.java	0.00%	32 Missing ⚠️
.../apache/spark/shuffle/RssStageResubmitManager.java	0.00%	22 Missing ⚠️
...spark/shuffle/handle/MutableShuffleHandleInfo.java	0.00%	22 Missing ⚠️
...niffle/server/netty/ShuffleServerNettyHandler.java	0.00%	9 Missing ⚠️
...ffle/client/request/RssRegisterShuffleRequest.java	0.00%	7 Missing ⚠️
...fle/shuffle/manager/ShuffleManagerGrpcService.java	0.00%	6 Missing ⚠️
...ffle/client/impl/grpc/ShuffleServerGrpcClient.java	0.00%	6 Missing ⚠️
...ffle/client/request/RssSendShuffleDataRequest.java	0.00%	5 Missing ⚠️
... and 6 more

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1584      +/-   ##
============================================
- Coverage     54.86%   53.42%   -1.45%     
- Complexity     2358     2943     +585     
============================================
  Files           368      435      +67     
  Lines         16379    23768    +7389     
  Branches       1504     2208     +704     
============================================
+ Hits           8986    12697    +3711     
- Misses         6862    10290    +3428     
- Partials        531      781     +250

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jerqi · 2024-03-22T10:20:36Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data. We should rely on the data skip to avoid reading the failure data.

zuston · 2024-03-22T10:22:34Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data.

Could you describe more?

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java

jerqi · 2024-03-23T12:40:31Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data.

Could you describe more?

There may be some tasks will write legacy data to the shuffle server after you delete the shuffle data. Because although we resubmit the stage, some tasks for last attempt may write the data. Spark doesn't guarantee that all tasks will be ended from last attempt although you have started the newest attempt.

jerqi · 2024-03-25T02:34:29Z

@EnricoMi If we have the retry of stage, the taskId may not unique. Because we don't have stage attemptId to differ task 1 attempt 0 in the stage attempt 0 and task 1 attempt 0 in the stage attempt 1. This may cause we read wrong data.

zuston · 2024-03-25T03:17:05Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data.

Could you describe more?

There may be some tasks will write legacy data to the shuffle server after you delete the shuffle data. Because although we resubmit the stage, some tasks for last attempt may write the data. Spark doesn't guarantee that all tasks will be ended from last attempt although you have started the newest attempt.

If so, we'd better to reject the shuffle data of older version. This could be implemented by maintaining the latest staeg attempt id

jerqi · 2024-03-25T07:28:12Z

It's dangerous to delete the failed data of the stage when we retry. It's hard to reach the condition to delete the data.

Could you describe more?

There may be some tasks will write legacy data to the shuffle server after you delete the shuffle data. Because although we resubmit the stage, some tasks for last attempt may write the data. Spark doesn't guarantee that all tasks will be ended from last attempt although you have started the newest attempt.

If so, we'd better to reject the shuffle data of older version. This could be implemented by maintaining the latest staeg attempt id

OK, Maybe rejection the legacy data will be better choice.

jerqi · 2024-03-25T07:28:43Z

@EnricoMi If we have the retry of stage, the taskId may not unique. Because we don't have stage attemptId to differ task 1 attempt 0 in the stage attempt 0 and task 1 attempt 0 in the stage attempt 1. This may cause we read wrong data.

Ignore this. Maybe rejection legacy data will be a better choice.

EnricoMi · 2024-03-25T09:07:19Z

server/src/main/java/org/apache/uniffle/server/ShuffleServerGrpcService.java

@@ -158,6 +158,30 @@ public void registerShuffle(
    String remoteStoragePath = req.getRemoteStorage().getPath();
    String user = req.getUser();

+    if (req.getIsStageRetry()) {


If removeShuffleDataSync is always being called, we can avoid adding plumbing isStateRetry in here. When isStateRetry == false, this is a NOOP.

Method removeShuffleDataSync might return true if it found data to delete, so we can conditionally log the message below.

I prefer reserving the isStageRetry(or use stage attempt number to replace this) param for 2 reasons

this is more explicit for stage retry, especially when something go wrong, like the previous data has been purged due to expire heartbeat. If having this, the log will indicate the abnormal problem happens

for the next PR, I will introduce the stage latest attempt to discard the older attempt data.

all this plumbing for logging is peculiar

maybe there are better mechanisms to discard older data

zuston · 2024-03-26T03:55:14Z

Could you help review this? @EnricoMi @jerqi spark2 change will be finished after this PR is OK for you

jerqi · 2024-03-26T05:45:28Z

Could you help review this? @EnricoMi @jerqi spark2 change will be finished after this PR is OK for you

Several questions:

How to reject the legacy requests?
How to delete the legacy shuffle?

zuston · 2024-03-26T06:02:06Z

How to reject the legacy requests?

Using the latest attemtp id in server side to check whether the send request is valid with the older version, this will be finished in the next PR.

How to delete the legacy shuffle?

This has been involved in this PR.

EnricoMi · 2024-03-26T10:49:08Z

Can we register a shuffle as the tuple (shuffle_id, stage_attempt_id)? This way, we do not need to wait for (shuffle_id, 0) to be be deleted synchronously, and can go on registering and writing (shuffle_id, 1). Deletion could take a significant time for large partitions (think TBs).

EnricoMi · 2024-03-26T10:53:01Z

I think deletion of earlier shuffle data should not be synchronously in the first place! That is flawed by design. Think of TB of shuffle data. They should be deleted quickly / constant time (e.g. HDFS move) and cleaned up asynchronously (e.g. HDMF delete).

zuston · 2024-03-26T11:32:03Z

Can we register a shuffle as the tuple (shuffle_id, stage_attempt_id)? This way, we do not need to wait for (shuffle_id, 0) to be be deleted synchronously, and can go on registering and writing (shuffle_id, 1). Deletion could take a significant time for large partitions (think TBs).

Agree with you. I’m concerned about the cost of refactor.

jerqi · 2024-05-15T08:50:26Z

proto/src/main/proto/Rss.proto

@@ -184,6 +184,7 @@ message ShuffleRegisterRequest {
  string user = 5;
  DataDistribution shuffleDataDistribution = 6;
  int32 maxConcurrencyPerPartitionToWrite = 7;
+  int32 stageAttemptNumber = 8;


How to reject legacy data? The legacy data won't call register request.

The legacy request could be rejected according to attemptNumber in sendShuffleData + reportShuffleResult rpc.

cc @yl09099 Pay more attension.

This should be solved.

zuston · 2024-05-16T02:28:48Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/ChainShuffleHandleInfo.java

+public class ChainShuffleHandleInfo extends ShuffleHandleInfoBase {
+  private static final Logger LOGGER = LoggerFactory.getLogger(MutableShuffleHandleInfo.class);
+
+  private Map<Integer, List<ShuffleServerInfo>> currentPartitionToServers;


Emmm. This is not right.

private ShuffleHandleInfo current; private LinkedList<ShuffleHandleInfo> historyHandles;

zuston · 2024-05-16T02:32:26Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/ShuffleHandleInfo.java

+   * When a Stage retry occurs, replace the current PartitionToShuffleServer and record the
+   * historical PartitionToShuffleServe.
+   */
+  default void replaceCurrentShuffleHandleInfo(


I think there is no need to introduce the extra general interface method here. If you want to update the handleInfo, you could forcelly transform type into ChainShuffleHandleInfo and then to update its inner current handle.

zuston · 2024-05-16T02:34:01Z

client-spark/common/src/main/java/org/apache/uniffle/shuffle/manager/RssShuffleManagerBase.java

+  @Override
+  public boolean reassignOnStageResubmit(
+      int stageId, int stageAttemptNumber, int shuffleId, int numPartitions) {
+    synchronized (reassignLock) {


the reassignLock is for the whole app, but from this RPC origin semantic, I think this lock should only be applied on the shuffleId level.

zuston · 2024-05-16T02:36:13Z

client-spark/common/src/main/java/org/apache/uniffle/shuffle/manager/RssShuffleManagerBase.java

+    synchronized (reassignLock) {
+      String stageIdAndAttempt = stageId + "_" + stageAttemptNumber;
+      Boolean needReassign =
+          rssStageResubmitManager.recordAndGetServerAssignedInfo(stageIdAndAttempt);


And if the attempt is less than the existing max attempt number, the reassign should be illegal.

zuston · 2024-05-16T02:37:40Z

client-spark/common/src/main/java/org/apache/spark/shuffle/RssStageResubmitManager.java

+
+public class RssStageResubmitManager {
+  /** A list of shuffleServer for Write failures */
+  private Set<String> failuresShuffleServerIds;


Lack the default constructor to initialize these vars.

zuston · 2024-05-22T06:43:06Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/ChainShuffleHandleInfo.java

+import org.apache.uniffle.common.ShuffleServerInfo;
+import org.apache.uniffle.proto.RssProtos;
+
+public class ChainShuffleHandleInfo extends ShuffleHandleInfoBase {


How about renaming to StageAttemptShuffleHandleInfo

zuston · 2024-05-22T06:43:45Z

client-spark/common/src/main/java/org/apache/spark/shuffle/handle/MutableShuffleHandleInfo.java

-                  .build();
-          replicaServersProto.put(replicaServerEntry.getKey(), item);
-        }
+    Map<Integer, RssProtos.PartitionReplicaServers> partitionToServers = new HashMap<>();


Why removing the synchronized ?

zuston · 2024-05-22T06:47:43Z

client-spark/common/src/main/java/org/apache/uniffle/shuffle/manager/RssShuffleManagerBase.java

+  @Override
+  public boolean reassignOnStageResubmit(
+      int stageId, int stageAttemptNumber, int shuffleId, int numPartitions) {
+    ReentrantReadWriteLock.WriteLock shuffleWriteLock = getShuffleWriteLock(shuffleId);


I think this also could be added into the StageResubmitManager.

zuston · 2024-05-22T06:49:57Z

...park/common/src/main/java/org/apache/uniffle/shuffle/manager/RssShuffleManagerInterface.java


-  MutableShuffleHandleInfo reassignOnBlockSendFailure(
+  ChainShuffleHandleInfo reassignOnBlockSendFailure(


This should be still as MutableShuffleHandleInfo

zuston · 2024-05-22T06:51:24Z

proto/src/main/proto/Rss.proto

@@ -184,6 +184,7 @@ message ShuffleRegisterRequest {
  string user = 5;
  DataDistribution shuffleDataDistribution = 6;
  int32 maxConcurrencyPerPartitionToWrite = 7;
+  int32 stageAttemptNumber = 8;


This should be solved.

…at all previous data is cleared for stage retry

zuston force-pushed the stageRetry2 branch from 17944ab to 1c0710a Compare March 15, 2024 03:15

leslizhang pushed a commit to leslizhang/incubator-uniffle that referenced this pull request Mar 19, 2024

[apache#1584] Add metrics about block size distribution

27eeb41

zuston force-pushed the stageRetry2 branch 2 times, most recently from b192095 to 4d3a892 Compare March 22, 2024 08:25

zuston changed the title ~~[#1579] fix(spark): clear out previous stage attempt data synchronously~~ [#1579] fix(spark): Adjust reassgin time to avoid failure to clean up previous stage data Mar 22, 2024

EnricoMi reviewed Mar 22, 2024

View reviewed changes

server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java Show resolved Hide resolved

EnricoMi reviewed Mar 25, 2024

View reviewed changes

zuston changed the title ~~[#1579] fix(spark): Adjust reassgin time to avoid failure to clean up previous stage data~~ [#1579] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry Mar 26, 2024

zuston changed the title ~~[#1579] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry~~ [#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry Mar 26, 2024

zuston requested review from EnricoMi and jerqi March 26, 2024 03:49

zuston mentioned this pull request Mar 26, 2024

[#1579][part-2] feat(spark): allow to register shuffle in parallel #1604

Closed

zuston force-pushed the stageRetry2 branch from 536a32f to 1714ab3 Compare March 26, 2024 09:43

yl09099 force-pushed the stageRetry2 branch from af3264f to 57894b5 Compare May 15, 2024 06:18

jerqi reviewed May 15, 2024

View reviewed changes

zuston commented May 16, 2024

View reviewed changes

yl09099 force-pushed the stageRetry2 branch 17 times, most recently from d922716 to ae02409 Compare May 21, 2024 06:07

zuston commented May 22, 2024

View reviewed changes

[apache#1579][part-1] fix(spark): Adjust reassigned time to ensure th…

fa40ceb

…at all previous data is cleared for stage retry

yl09099 force-pushed the stageRetry2 branch from ae02409 to fa40ceb Compare May 23, 2024 06:11

[apache#1579][part-1] fix(spark): Adjust reassigned time to ensure th…

595750b

…at all previous data is cleared for stage retry

yl09099 force-pushed the stageRetry2 branch 4 times, most recently from 522926d to 5c9d9e3 Compare May 24, 2024 09:42

[apache#1579][part-1] fix(spark): Adjust reassigned time to ensure th…

3dd6b34

…at all previous data is cleared for stage retry

yl09099 force-pushed the stageRetry2 branch from 5c9d9e3 to 3dd6b34 Compare May 24, 2024 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

zuston commented Mar 15, 2024 •

edited

zuston commented Mar 15, 2024

github-actions bot commented Mar 15, 2024 •

edited

zuston commented Mar 15, 2024 •

edited

codecov-commenter commented Mar 22, 2024 •

edited

jerqi commented Mar 22, 2024

zuston commented Mar 22, 2024

jerqi commented Mar 23, 2024 •

edited

jerqi commented Mar 25, 2024

zuston commented Mar 25, 2024

jerqi commented Mar 25, 2024

jerqi commented Mar 25, 2024

EnricoMi Mar 25, 2024

zuston Mar 26, 2024

EnricoMi Mar 26, 2024

zuston commented Mar 26, 2024

jerqi commented Mar 26, 2024

zuston commented Mar 26, 2024

EnricoMi commented Mar 26, 2024

EnricoMi commented Mar 26, 2024 •

edited

zuston commented Mar 26, 2024

jerqi May 15, 2024 •

edited

zuston May 15, 2024

zuston May 15, 2024

zuston May 22, 2024

zuston May 16, 2024

zuston May 16, 2024

zuston May 16, 2024

zuston May 16, 2024

zuston May 16, 2024

zuston May 22, 2024

zuston May 22, 2024

zuston May 22, 2024

zuston May 22, 2024

zuston May 22, 2024


		MutableShuffleHandleInfo reassignOnBlockSendFailure(
		ChainShuffleHandleInfo reassignOnBlockSendFailure(

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

Are you sure you want to change the base?

[#1579][part-1] fix(spark): Adjust reassigned time to ensure that all previous data is cleared for stage retry #1584

Conversation

zuston commented Mar 15, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

zuston commented Mar 15, 2024

github-actions bot commented Mar 15, 2024 • edited

Test Results

zuston commented Mar 15, 2024 • edited

codecov-commenter commented Mar 22, 2024 • edited

Codecov Report

jerqi commented Mar 22, 2024

zuston commented Mar 22, 2024

jerqi commented Mar 23, 2024 • edited

jerqi commented Mar 25, 2024

zuston commented Mar 25, 2024

jerqi commented Mar 25, 2024

jerqi commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuston commented Mar 26, 2024

jerqi commented Mar 26, 2024

zuston commented Mar 26, 2024

EnricoMi commented Mar 26, 2024

EnricoMi commented Mar 26, 2024 • edited

zuston commented Mar 26, 2024

jerqi May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuston commented Mar 15, 2024 •

edited

github-actions bot commented Mar 15, 2024 •

edited

zuston commented Mar 15, 2024 •

edited

codecov-commenter commented Mar 22, 2024 •

edited

jerqi commented Mar 23, 2024 •

edited

EnricoMi commented Mar 26, 2024 •

edited

jerqi May 15, 2024 •

edited