Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate elements in split_between_processes #2750

Closed
4 tasks
hkunzhe opened this issue May 7, 2024 · 4 comments · Fixed by #2781
Closed
4 tasks

Duplicate elements in split_between_processes #2750

hkunzhe opened this issue May 7, 2024 · 4 comments · Fixed by #2781

Comments

@hkunzhe
Copy link
Contributor

hkunzhe commented May 7, 2024

System Info

accelerate==git+https://github.com/huggingface/accelerate.git

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

  1. Test example test_accelerate.py
from accelerate import PartialState  # Can also be Accelerator or AcceleratorState

state = PartialState()
input_list = list(range(17))

with state.split_between_processes(input_list) as splitted_input_list:
    print(f"{state.device}, {splitted_input_list}")
  1. Run accelerate launch --num_processes 8 test_accelerate.py
  2. Output
cuda:2, [6, 7, 8]
cuda:7, [16]
cuda:1, [3, 4, 5]
cuda:4, [12, 13, 14]
cuda:3, [9, 10, 11]
cuda:6, [16]
cuda:0, [0, 1, 2]
cuda:5, [15, 16]

As we can see, there are three 16 among three different processes. It is caused by

num_samples_per_process = math.ceil(length / self.num_processes)
start_index = self.process_index * num_samples_per_process
end_index = start_index + num_samples_per_process

We should modify above codes as following to get the expected output

num_samples_per_process = length // self.num_processes
num_extras = length % self.num_processes
start_index = self.process_index * num_samples_per_process + min(self.process_index, num_extras)
end_index = start_index + num_samples_per_process + (1 if self.process_index < num_extras else 0)

Expected behavior

  • Expected output:
cuda:2, [5, 6]
cuda:7, [15, 16]
cuda:1, [3, 4]
cuda:4, [9, 10]
cuda:3, [7, 8]
cuda:6, [13, 14]
cuda:0, [0, 1, 2]
cuda:5, [11, 12]
@hkunzhe
Copy link
Contributor Author

hkunzhe commented May 9, 2024

@muellerzr , Could you check it out?

@hammoudhasan
Copy link

hammoudhasan commented May 14, 2024

I also observed the same thing on my end! I would split a list of 100 elements on 8 gpus and get a total of 113 elements or something. The fix proposed by @hkunzhe worked for me.

@muellerzr
Copy link
Collaborator

Thanks for the flag! Would you like to make a PR with your fix? :)

@hkunzhe
Copy link
Contributor Author

hkunzhe commented May 15, 2024

Thanks for the flag! Would you like to make a PR with your fix? :)

Sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants