[#1398] fix(mr,tez): Make attempId computable and move it to taskAttemptId in BlockId layout. #1418

qijiale76 · 2024-01-04T10:15:26Z

What changes were proposed in this pull request?

Before this PR, in MR and TEZ engine:

attemptId is in sequenceNo of BlockId instead of taskAttemptId.
taskAttemptId is long which is not necessary instead of int.
attempId is fixed 6 bit.

After this PR:

attemptId is in taskAttemptId. This is more reasonable.
taskAttemptId is changed to int.
attempId is calculated from max num of allowed failures and whether speculative execution is enabled.

Why are the changes needed?

Fix: #1398

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UT and integrated tests.

codecov-commenter · 2024-01-04T10:26:56Z

Codecov Report

Attention: Patch coverage is 79.10448% with 14 lines in your changes are missing coverage. Please review.

Project coverage is 54.88%. Comparing base (dd67774) to head (bc5585d).
Report is 4 commits behind head on master.

Files	Patch %	Lines
...c/main/java/org/apache/tez/common/RssTezUtils.java	76.19%	4 Missing and 1 partial ⚠️
...rg/apache/hadoop/mapred/RssMapOutputCollector.java	0.00%	3 Missing ⚠️
...library/common/shuffle/impl/RssTezFetcherTask.java	0.00%	3 Missing ⚠️
...library/common/shuffle/impl/RssShuffleManager.java	0.00%	2 Missing ⚠️
...n/java/org/apache/hadoop/mapreduce/RssMRUtils.java	95.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1418      +/-   ##
============================================
+ Coverage     54.01%   54.88%   +0.87%     
- Complexity     2863     2868       +5     
============================================
  Files           438      418      -20     
  Lines         24850    22552    -2298     
  Branches       2114     2120       +6     
============================================
- Hits          13423    12378    -1045     
+ Misses        10586     9406    -1180     
+ Partials        841      768      -73

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

qijiale76 · 2024-01-04T10:48:54Z

@jerqi Could you please provide suggestions on areas that need improvement?

zuston · 2024-01-08T09:33:38Z

cc @zhengchenyu could you help review this ?

zhengchenyu · 2024-02-27T06:43:44Z

Sine #1529 is merged into master, I think we should review this PR?
After this PR, the blockid calculation for spark, mr, tez will remain consistent. Then we will reduce the probability of overflow problems.

@qijiale76 Can you reconstruct the code according to #1529?

zuston · 2024-02-29T03:48:30Z

@qijiale76 Do you want to push this forward?

qijiale76 · 2024-02-29T04:30:42Z

@qijiale76 Do you want to push this forward?

Yes, I’ll reconstruct the code next week.

… taskAttemptId in BlockId.

client-mr/core/src/main/java/org/apache/hadoop/mapreduce/RssMRUtils.java

client-tez/src/main/java/org/apache/tez/common/RssTezUtils.java

client-mr/core/src/main/java/org/apache/hadoop/mapreduce/RssMRUtils.java

client-tez/src/main/java/org/apache/tez/common/RssTezUtils.java

github-actions · 2024-03-21T13:37:00Z

Test Results

2 340 files ± 0 2 340 suites ±0 4h 30m 5s ⏱️ - 1m 56s
908 tests ± 0 907 ✅ ± 0 1 💤 ±0 0 ❌ ±0
10 551 runs +10 10 537 ✅ +10 14 💤 ±0 0 ❌ ±0

Results for commit bc5585d. ± Comparison against base commit 32d533d.

This pull request removes 2 and adds 2 tests. Note that renamed tests count towards both.

org.apache.uniffle.shuffle.manager.RssShuffleManagerBaseTest ‑ testGetAttemptIdBits
org.apache.uniffle.shuffle.manager.RssShuffleManagerBaseTest ‑ testGetMaxAttemptNo

org.apache.uniffle.client.ClientUtilsTest ‑ testGetMaxAttemptNo
org.apache.uniffle.client.ClientUtilsTest ‑ testGetNumberOfSignificantBits

♻️ This comment has been updated with latest results.

…m long to int."

qijiale76 · 2024-03-28T11:44:43Z

@EnricoMi Thanks for your very helpful review. I have updated this PR based on your suggestions and by referring to Spark's implementation. Could you please review the latest code again?

client/src/main/java/org/apache/uniffle/client/util/ClientUtils.java

client-mr/core/src/main/java/org/apache/hadoop/mapreduce/RssMRUtils.java

client-mr/core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/RssEventFetcher.java

client-tez/src/main/java/org/apache/tez/common/RssTezUtils.java

...t-tez/src/main/java/org/apache/tez/runtime/library/output/RssOrderedPartitionedKVOutput.java

...tez/src/main/java/org/apache/tez/runtime/library/output/RssUnorderedPartitionedKVOutput.java

EnricoMi · 2024-04-18T08:44:18Z

common/src/main/java/org/apache/uniffle/common/util/BlockIdLayout.java

@@ -143,7 +143,7 @@ public int hashCode() {
    return Objects.hash(sequenceNoBits, partitionIdBits, taskAttemptIdBits);
  }

-  public long getBlockId(int sequenceNo, int partitionId, long taskAttemptId) {
+  public long getBlockId(int sequenceNo, int partitionId, int taskAttemptId) {


I think this long -> int change here should be reverted, because here we check the original long task attempt id is within block id layout constraints. Only task attempt ids used after the block id accepted them are reduced to int.

EnricoMi · 2024-04-18T08:46:11Z

client-mr/core/src/main/java/org/apache/hadoop/mapreduce/RssMRUtils.java

-        taskAttemptId - (attemptId << (LAYOUT.partitionIdBits + LAYOUT.taskAttemptIdBits));
-
-    return LAYOUT.getBlockId(atomicInt, partitionId, taskId);
+  public static long getBlockId(int partitionId, int taskAttemptId, int nextSeqNo) {


Technically, the taskAttemptId can be long here as this before block id layout checks the bit size constraint (though we feed this method only with int taskAttemptIds produced by RssMRUtils.createRssTaskAttemptId()):

Suggested change

public static long getBlockId(int partitionId, int taskAttemptId, int nextSeqNo) {

public static long getBlockId(int partitionId, long taskAttemptId, int nextSeqNo) {

EnricoMi · 2024-04-18T08:46:53Z

client-mr/core/src/main/java/org/apache/hadoop/mapreduce/RssMRUtils.java

-
-    return LAYOUT.getBlockId(atomicInt, partitionId, taskId);
+  public static long getBlockId(int partitionId, int taskAttemptId, int nextSeqNo) {
+    return LAYOUT.getBlockId(nextSeqNo, partitionId, taskAttemptId);
  }

  public static long getTaskAttemptId(long blockId) {


This task attempt id is derived from the block id, hence it is reduced in its bit size:

Suggested change

public static long getTaskAttemptId(long blockId) {

public static int getTaskAttemptId(long blockId) {

The caller of this method can continue to upcast the returned int to long, no problem.

EnricoMi · 2024-04-18T08:49:38Z

client-tez/src/main/java/org/apache/tez/common/RssTezUtils.java

-
-    return LAYOUT.getBlockId(atomicInt, partitionId, taskId);
+  public static long getBlockId(int partitionId, int taskAttemptId, int nextSeqNo) {
+    return LAYOUT.getBlockId(nextSeqNo, partitionId, taskAttemptId);
  }

  public static long getTaskAttemptId(long blockId) {


This task attempt id is derived from the block id, hence it is reduced in its bit size:

Suggested change

public static long getTaskAttemptId(long blockId) {

public static int getTaskAttemptId(long blockId) {

The caller of this method can continue to upcast the returned int to long, no problem.

EnricoMi · 2024-04-18T08:52:16Z

client-tez/src/main/java/org/apache/tez/common/RssTezUtils.java

-        taskAttemptId - (attemptId << (LAYOUT.partitionIdBits + LAYOUT.taskAttemptIdBits));
-
-    return LAYOUT.getBlockId(atomicInt, partitionId, taskId);
+  public static long getBlockId(int partitionId, int taskAttemptId, int nextSeqNo) {


Technically, the taskAttemptId can be long here as this before block id layout checks the bit size constraint (though we feed this method only with int taskAttemptIds produced by RssTezUtils.createRssTaskAttemptId()):

Suggested change

public static long getBlockId(int partitionId, int taskAttemptId, int nextSeqNo) {

public static long getBlockId(int partitionId, long taskAttemptId, int nextSeqNo) {

EnricoMi · 2024-04-18T09:06:39Z

common/src/main/java/org/apache/uniffle/common/util/BlockIdLayout.java

@@ -185,13 +185,13 @@ public BlockId asBlockId(long blockId) {
        blockId, this, getSequenceNo(blockId), getPartitionId(blockId), getTaskAttemptId(blockId));
  }

-  public BlockId asBlockId(int sequenceNo, int partitionId, long taskAttemptId) {


EnricoMi · 2024-04-18T09:26:29Z

client-mr/core/src/main/java/org/apache/hadoop/mapred/SortWriteBufferManager.java

@@ -64,7 +64,7 @@ public class SortWriteBufferManager<K, V> {
  private final Counters.Counter mapOutputRecordCounter;
  private long uncompressedDataLen = 0;
  private long compressTime = 0;
-  private final long taskAttemptId;
+  private final int taskAttemptId;


I am not sure about restricting taskAttemptIds to int in such places.

Here is the situation:

Spark, Tez and MR provide us with long task attempt ids (for Tez and MR, (taskId, attemptId) constitutes a long task attempt id, which we restrict to int for similar reasons as in 2.)

for the purpose of the block id, we limit those long task attempt ids to int, since we allow only less that 32 bits for it

the task attempt id retrieved from the block id is int because of that

still, all other places could continue to work with long task attempt ids if that makes no difference for that code, up-casting int task attempt ids to long does not harm, as long as the code works with long.

This allows to support truly long task attempt ids without reverting such code changes in the future.

@zuston @jerqi @zhengchenyu what do you think?

zuston requested a review from zhengchenyu January 8, 2024 09:33

fix(MR)(TEZ): Limit attemptId to 4 bit and move it from sequenceNo to…

f2bbf08

… taskAttemptId in BlockId.

qijiale76 force-pushed the issue#1398 branch from 239c418 to f2bbf08 Compare March 21, 2024 07:34

EnricoMi reviewed Mar 21, 2024

View reviewed changes

qijiale76 added 5 commits March 22, 2024 19:57

According to the review, modify the code and change taskAttemptId fro…

ba15386

…m long to int."

Resolve failed tests.

8787f0d

Resolve failed tests.

8fdfda5

Resolve tez bug.

08311bc

Calculate attemptBits from conf.

fdca0a5

jerqi changed the title ~~[#1398] fix(MR)(TEZ): Limit attemptId to 4 bit and move it from 18 bit atomicInt to 21 bit taskAttemptId in 63 bit BlockId.~~ [#1398] fix(mr,tez): Limit attemptId to 4 bit and move it from 18 bit atomicInt to 21 bit taskAttemptId in 63 bit BlockId. Mar 28, 2024

Resolve failed checkstyle.

f84ad00

qijiale76 requested a review from EnricoMi March 28, 2024 11:19

qijiale76 changed the title ~~[#1398] fix(mr,tez): Limit attemptId to 4 bit and move it from 18 bit atomicInt to 21 bit taskAttemptId in 63 bit BlockId.~~ [#1398] fix(mr,tez): Make attempId computable and move it to taskAttemptId in BlockId layout. Mar 28, 2024

qijiale76 marked this pull request as ready for review March 28, 2024 11:40

EnricoMi reviewed Mar 28, 2024

View reviewed changes

qijiale76 added 2 commits March 29, 2024 15:53

Update the code according to the review.

ee21967

fix checkstyle.

bc5585d

qijiale76 requested a review from EnricoMi April 17, 2024 10:39

EnricoMi requested changes Apr 18, 2024

View reviewed changes

qijiale76 mentioned this pull request Apr 23, 2024

[#1341] fix(mr): Fix MR Combiner ArrayIndexOutOfBoundsException Bug. #1666

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#1398] fix(mr,tez): Make attempId computable and move it to taskAttemptId in BlockId layout. #1418

[#1398] fix(mr,tez): Make attempId computable and move it to taskAttemptId in BlockId layout. #1418

qijiale76 commented Jan 4, 2024 •

edited

codecov-commenter commented Jan 4, 2024 •

edited

qijiale76 commented Jan 4, 2024

zuston commented Jan 8, 2024

zhengchenyu commented Feb 27, 2024

zuston commented Feb 29, 2024

qijiale76 commented Feb 29, 2024

github-actions bot commented Mar 21, 2024 •

edited

qijiale76 commented Mar 28, 2024

EnricoMi Apr 18, 2024

EnricoMi Apr 18, 2024

EnricoMi Apr 18, 2024

EnricoMi Apr 18, 2024

EnricoMi Apr 18, 2024

EnricoMi Apr 18, 2024

EnricoMi Apr 18, 2024

	public static long getBlockId(int partitionId, int taskAttemptId, int nextSeqNo) {
	public static long getBlockId(int partitionId, long taskAttemptId, int nextSeqNo) {

	public static long getTaskAttemptId(long blockId) {
	public static int getTaskAttemptId(long blockId) {

[#1398] fix(mr,tez): Make attempId computable and move it to taskAttemptId in BlockId layout. #1418

Are you sure you want to change the base?

[#1398] fix(mr,tez): Make attempId computable and move it to taskAttemptId in BlockId layout. #1418

Conversation

qijiale76 commented Jan 4, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

codecov-commenter commented Jan 4, 2024 • edited

Codecov Report

qijiale76 commented Jan 4, 2024

zuston commented Jan 8, 2024

zhengchenyu commented Feb 27, 2024

zuston commented Feb 29, 2024

qijiale76 commented Feb 29, 2024

github-actions bot commented Mar 21, 2024 • edited

Test Results

qijiale76 commented Mar 28, 2024

EnricoMi Apr 18, 2024

Choose a reason for hiding this comment

EnricoMi Apr 18, 2024

Choose a reason for hiding this comment

EnricoMi Apr 18, 2024

Choose a reason for hiding this comment

EnricoMi Apr 18, 2024

Choose a reason for hiding this comment

EnricoMi Apr 18, 2024

Choose a reason for hiding this comment

EnricoMi Apr 18, 2024

Choose a reason for hiding this comment

EnricoMi Apr 18, 2024

Choose a reason for hiding this comment

qijiale76 commented Jan 4, 2024 •

edited

codecov-commenter commented Jan 4, 2024 •

edited

github-actions bot commented Mar 21, 2024 •

edited