How to specify the backend of Trainer #2759

Orion-Zheng · 2024-05-10T03:18:08Z

System Info

accelerate 0.28.0

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I am running a multi-node, multi-gpu training code on two nodes with one A100-40GB respectively. I don't have the NCCL installed on this cluster, so I am trying to use the default gloo backend to start training. But I didn't find any documents on how to specify backend when accelerate launch. Any help will be very appreciated!
Here is my launching script.

srun -N 2 -n 2 -w xgpg2,xgpg3 accelerate launch --config_file /tmp/my_dist_config.yaml --gradient_accumulation_steps 8 --gradient_clipping 1.0 --mixed_precision bf16 train.py    ...my training arguments..

Here is my accelerate config on each node.

# `/tmp/my_dist_config.yaml` on xgpg2
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: xgpg2
main_process_port: 9999
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
# `/tmp/my_dist_config.yaml` on xgpg3
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: xgpg2
main_process_port: 9999
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Here is the main body of my training code

...
tokenizer = load_tokenizer(model_args.tokenizer_dir, train_mode=model_args.do_train)
model = load_model(model_args, quant_config, peft_config)
logger.info(f"Model Architecture:\n{model}")
print_trainable_parameters(model)

trainer = Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data, 
    args=trainer_config,
    data_collator=PaddToMaxLenCollator(tokenizer, model_args.max_length), 
)

# Training
if model_args.do_train:
    train_result = trainer.train(resume_from_checkpoint=model_args.resume_from_checkpoint)
    trainer.log_metrics("train", train_result.metrics)  
    trainer.save_metrics("train", train_result.metrics) 
...

I tried to run this directly, but it went into some NCCL error like this:

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987394225/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3

I think the NCCL isn't installed on the system by system administrator, but there is a nccl library in my conda environment, which could probably be installed as some other library's dependency. I am not familiar with NCCL, but my understanding is this won't work because NCCL should be installed on system level. Am I right?

# Name                    Version                   Build  Channel
nccl                      2.21.5.1             h3a97aeb_0    conda-forge

Expected behavior

Hope to know how to use the 'gloo' backend for Trainer. And also hope to know if I can use Trainer's Deepspeed Integration with gloo backend

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to specify the backend of Trainer #2759

How to specify the backend of Trainer #2759

Orion-Zheng commented May 10, 2024

How to specify the backend of Trainer #2759

How to specify the backend of Trainer #2759

Comments

Orion-Zheng commented May 10, 2024

System Info

Information

Tasks

Reproduction

Expected behavior