Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to specify the backend of Trainer #2759

Open
2 of 4 tasks
Orion-Zheng opened this issue May 10, 2024 · 0 comments
Open
2 of 4 tasks

How to specify the backend of Trainer #2759

Orion-Zheng opened this issue May 10, 2024 · 0 comments

Comments

@Orion-Zheng
Copy link

System Info

accelerate 0.28.0

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I am running a multi-node, multi-gpu training code on two nodes with one A100-40GB respectively. I don't have the NCCL installed on this cluster, so I am trying to use the default gloo backend to start training. But I didn't find any documents on how to specify backend when accelerate launch. Any help will be very appreciated!
Here is my launching script.

srun -N 2 -n 2 -w xgpg2,xgpg3 accelerate launch --config_file /tmp/my_dist_config.yaml --gradient_accumulation_steps 8 --gradient_clipping 1.0 --mixed_precision bf16 train.py    ...my training arguments..

Here is my accelerate config on each node.

# `/tmp/my_dist_config.yaml` on xgpg2
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: xgpg2
main_process_port: 9999
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
# `/tmp/my_dist_config.yaml` on xgpg3
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: xgpg2
main_process_port: 9999
main_training_function: main
mixed_precision: bf16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Here is the main body of my training code

...
tokenizer = load_tokenizer(model_args.tokenizer_dir, train_mode=model_args.do_train)
model = load_model(model_args, quant_config, peft_config)
logger.info(f"Model Architecture:\n{model}")
print_trainable_parameters(model)

trainer = Trainer(
    model=model,
    train_dataset=train_data,
    eval_dataset=eval_data, 
    args=trainer_config,
    data_collator=PaddToMaxLenCollator(tokenizer, model_args.max_length), 
)

# Training
if model_args.do_train:
    train_result = trainer.train(resume_from_checkpoint=model_args.resume_from_checkpoint)
    trainer.log_metrics("train", train_result.metrics)  
    trainer.save_metrics("train", train_result.metrics) 
...

I tried to run this directly, but it went into some NCCL error like this:

torch.distributed.DistBackendError: NCCL error in: /opt/conda/conda-bld/pytorch_1704987394225/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3

I think the NCCL isn't installed on the system by system administrator, but there is a nccl library in my conda environment, which could probably be installed as some other library's dependency. I am not familiar with NCCL, but my understanding is this won't work because NCCL should be installed on system level. Am I right?

# Name                    Version                   Build  Channel
nccl                      2.21.5.1             h3a97aeb_0    conda-forge

Expected behavior

Hope to know how to use the 'gloo' backend for Trainer. And also hope to know if I can use Trainer's Deepspeed Integration with gloo backend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant