-
Notifications
You must be signed in to change notification settings - Fork 841
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node. #2780
Comments
It looks like the issue stems from this part of the accelerate launcher: accelerate/src/accelerate/utils/launch.py Line 309 in 4ad4d28
It's not considering individual machine's devices or the slots specified in the deepspeed hostfile or anything like that. |
Correct. We don’t support that currently. At least on the torch side, I believe they require the same number of machines per node. Does DeepSpeed not? |
I had looked into it and I don't believe an equal number of devices is strictly required on either torch or DeepSpeed. Nothing with NCCL or MPI / DeepSpeeds launcher would make me immediately think that. With that said from my end I’m more than happy to implement changes here to support this and contribute back I just want to make sure that’s desirable from your guys end. |
@muellerzr It works fine if I just comment out that line above in a fork, I'd like your opinion on how to proceed here. Attached is a screenshot showing the training session in the top right on a custom loop using accelerate. The one GPU machine is printing SMI on the left and the 4 gpu box on the bottom right. For more context DeepSpeed can handle arbitrary arrangements of GPUs during training, it will respect whatever number of slots are set in the provided hostfile. |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)My snippet of code for loading model resources:
Launched with
accelerate launch --config_file deep.cfg <SCRIPT>
Reproduction
Have two connected nodes with differing numbers of gpus, preferably even and odd (in my case one node has 4 and one has 1), launch DeepSpeed with either standard or pdsh launchers, watch the device ordinals be incorrect on one of the machines when attempting to load model resources.
Expected behavior
In both cases standard and pdsh, but pdsh in particular since it takes a hostfile with the number of slots per host, I would expect accelerate to to launch processes equivalent to the number of available devices per host and not just attempt to evenly split the processes across hosts.
For further context my full cluster will have 9 devices, two nodes of 4 and one node of 1. to answer why the weird setup. It just seems like such a waste to let my rtx a6000 sit out of all the fun my 3090s are having, plus the additional system and vram from it. 😄
The text was updated successfully, but these errors were encountered: