Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support client partition data reassign #1608

Closed
6 of 9 tasks
zuston opened this issue Mar 29, 2024 · 2 comments · Fixed by #1609 or #1693 · May be fixed by #1617
Closed
6 of 9 tasks

Support client partition data reassign #1608

zuston opened this issue Mar 29, 2024 · 2 comments · Fixed by #1609 or #1693 · May be fixed by #1617

Comments

@zuston
Copy link
Member

zuston commented Mar 29, 2024

Motivation

After reviewing #1445 again(partition data reassign, which is disabled by default in the master branch), I found some bugs and design problems. I will use this issue to track the further optimizations.

Subtasks tracking

Design thought

reassign rpc between with spark driver + executor

This part has been involved in the #1445 design doc, I will not describe more.

reassign signal propagation

In current codebase, the latest reassign partition-> servers plan won't be propagated into the next start tasks.
To solve this problem, I will make writer always get the latest partition->servers plan. Once the reassign signal happens,
the cached shuffleHandleInfo will be updated by the reassign rpc returned.

For the next start task(task2) after reassign tasks finished, task2 will get the latest plan according to the replacement + normal servers list. It will avoid writing to the faulty servers again.

reassign multiple servers for one partition

This topic is scoped in the single replica.

For the different type partition, we will have different strategies for the partition -> multiple servers assign.
For huge partition, I will hope that after recogizing the huge_partition, we will request reassign multiple servers by rpc, and the task will acheive its owned partitioned server by the hash mechanism by its taskAttemptId,
which will make load balance valid.

For normal partition, the multiple servers are only valid on the reassign multiple times due to the expected problems.
For this case, the task will always get the last server to write.

image

image

@rickyma
Copy link
Contributor

rickyma commented Mar 29, 2024

I found some bugs and design problems

What kind of bugs did you find? What will the bug cause? Data loss, or ... ? Can you elaborate more in this issue?

@zuston zuston changed the title [Improvement] One partition data could be written to multiple servers [Improvement] Optimize partition data reassignment Mar 30, 2024
@zuston
Copy link
Member Author

zuston commented Mar 30, 2024

I found some bugs and design problems

What kind of bugs did you find? What will the bug cause? Data loss, or ... ? Can you elaborate more in this issue?

This is just for the partition data reassignment, which will not effect data correctness and loss.

@zuston zuston changed the title [Improvement] Optimize partition data reassignment [Improvement] Partition data reassignment Mar 30, 2024
@zuston zuston changed the title [Improvement] Partition data reassignment [Feature] Partition data reassignment Apr 1, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 1, 2024
zuston added a commit that referenced this issue Apr 2, 2024
…ulty servers in one stage (#1609)

### What changes were proposed in this pull request?

1. Lock the `shuffleHandle` to ensure the thread safe when reassigning partial server for tasks
2. Only share the replacement servers for faulty servers in one stage rather than the whole app
3. Simplify the reassignment logic, like the single one replacement server which will be supported in the future, so let's remove it currently.
4. correct the `partitionIds` type from `string` to `int` in proto

### Why are the changes needed?

Fix: #1608

In current implementation of partition reassignment, it will share the same reassignment servers for the different stages, which will crash for app without registry.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UTs
@zuston zuston reopened this Apr 2, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 2, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 2, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 3, 2024
zuston added a commit that referenced this issue Apr 8, 2024
…ble block resend (#1610)

### What changes were proposed in this pull request?

1. avoid releasing block previously when enable block resend
2. introduce the block max retry times

### Why are the changes needed?

For: #1608

In the current codebase for partition reassignment, it has some bugs as follows
1. data has been released when resending.
2. if the blocks fail to resend, it may fast fail without retry again

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`RssShuffleWriterTest#blockFailureResendTest` is to test the resending block mechanism.
zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 8, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 8, 2024
@zuston zuston changed the title [Feature] Partition data reassignment Support partition data reassignment Apr 10, 2024
@zuston zuston changed the title Support partition data reassignment Support partition data reassign Apr 10, 2024
zuston added a commit that referenced this issue Apr 17, 2024
…signed servers (#1615)

### What changes were proposed in this pull request?

Support reading from partition block data reassignment servers.

### Why are the changes needed?

For: #1608

Writer has been writing data into reassignment servers, so it's necessary to read from reassignment servers.
And the blockId will be stored in their owned partition servers, so this PR can read blockIds from these servers and 
support min-replica requirements at the same time.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`PartitionBlockDataReassignTest` integration test.
zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 17, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue Apr 24, 2024
zuston added a commit that referenced this issue May 9, 2024
)

### What changes were proposed in this pull request?

1. make the write client always use the latest available assignment for the following writing when the block reassign happens.
2. support multi time retry for partition reassign
3. limit the max reassign server num of one partition
4. refactor the reassign rpc
5. rename the faultyServer -> receivingFailureServer. 

#### Reassign whole process
![image](https://github.com/apache/incubator-uniffle/assets/8609142/8afa5386-be39-4ccb-9c10-95ffb3154939)

#### Always using the latest assignment

To acheive always using the latest assignment, I introduce the `TaskAttemptAssignment` to get the latest assignment for current task. The creating process of AddBlockEvent also will apply the latest assignment by `TaskAttemptAssignment` 

And it will be updated by the `reassignOnBlockSendFailure` rpc. 
That means the original reassign rpc response will be refactored and replaced by the whole latest `shuffleHandleInfo`.

### Why are the changes needed?

This PR is the subtask for #1608.

Leverging the #1615 / #1610 / #1609, we have implemented the reassign servers mechansim when write client encounters the server failure or unhealthy. But this is not good enough that will not share the faulty server state to the unstarted tasks and latter `AddBlockEvent` .

### Does this PR introduce _any_ user-facing change?

Yes. 

### How was this patch tested?

Unit and integration tests.

Integration tests as follows:
1. `PartitionBlockDataReassignBasicTest` to validate the reassign mechanism valid
2. `PartitionBlockDataReassignMultiTimesTest` is to test the partition reassign mechanism of multiple retries.

---------

Co-authored-by: Enrico Minack <github@enrico.minack.dev>
zuston added a commit to zuston/incubator-uniffle that referenced this issue May 9, 2024
@zuston zuston changed the title Support partition data reassign Support client partition data reassign May 11, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue May 11, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue May 11, 2024
zuston added a commit to zuston/incubator-uniffle that referenced this issue May 11, 2024
zuston added a commit that referenced this issue May 15, 2024
### What changes were proposed in this pull request?

Verify the sent blocks count in write tasks for spark

### Why are the changes needed?

For #1608.
After introducing the reassign menchanism, the blocks' stored location will be dynamiclly changed. 
To ensure possible or potenial bugs, it's necessary to introduce the block count.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests are enough to ensure safe
zuston added a commit that referenced this issue May 15, 2024
…g options (#1693)

### What changes were proposed in this pull request?

1. add docs about reassign mechanism
2. rename the config from "rss.client.blockSendFailureRetry.enabled" to "rss.client.reassign.enabled"

### Why are the changes needed?

Fix: #1608

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Needn't
dingshun3016 pushed a commit to dingshun3016/incubator-uniffle that referenced this issue May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment