You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)
Reproduction
I am running a multi-node, multi-gpu training code on two nodes with one A100-40GB respectively. I don't have the NCCL installed on this cluster, so I am trying to use the default gloo backend to start training. But I didn't find any documents on how to specify backend when accelerate launch. Any help will be very appreciated!
Here is my launching script.
I tried to run this directly, but it went into some NCCL error like this:
torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987394225/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
I think the NCCL isn't installed on the system by system administrator, but there is a nccl library in my conda environment, which could probably be installed as some other library's dependency. I am not familiar with NCCL, but my understanding is this won't work because NCCL should be installed on system level. Am I right?
# Name Version Build Channel
nccl 2.21.5.1 h3a97aeb_0 conda-forge
Expected behavior
Hope to know how to use the 'gloo' backend for Trainer. And also hope to know if I can use Trainer's Deepspeed Integration with gloo backend
The text was updated successfully, but these errors were encountered:
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I am running a multi-node, multi-gpu training code on two nodes with one A100-40GB respectively. I don't have the
NCCL
installed on this cluster, so I am trying to use the defaultgloo
backend to start training. But I didn't find any documents on how to specify backend whenaccelerate launch
. Any help will be very appreciated!Here is my launching script.
Here is my accelerate config on each node.
Here is the main body of my training code
I tried to run this directly, but it went into some NCCL error like this:
I think the NCCL isn't installed on the system by system administrator, but there is a
nccl
library in my conda environment, which could probably be installed as some other library's dependency. I am not familiar with NCCL, but my understanding is this won't work because NCCL should be installed on system level. Am I right?Expected behavior
Hope to know how to use the 'gloo' backend for Trainer. And also hope to know if I can use Trainer's Deepspeed Integration with gloo backend
The text was updated successfully, but these errors were encountered: