Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node. #2780

iantbutler01 · 2024-05-14T22:08:33Z

System Info

- `Accelerate` version: 0.30.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/crow/venvs/bismuth/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 124.95 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` config passed:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: DEEPSPEED
	- mixed_precision: bf16
	- use_cpu: False
	- debug: False
	- num_processes: 5
	- machine_rank: 0
	- num_machines: 2
	- main_process_ip: localhost
	- main_process_port: 3141
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- enable_cpu_affinity: False
	- deepspeed_config: {'deepspeed_hostfile': '/home/crow/SoftwareProjects/rwkv-raven-lora-instruct/hostfile', 'deepspeed_multinode_launcher': 'pdsh', 'gradient_accumulation_steps': 8, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

My snippet of code for loading model resources:

        rank = int(os.environ["RANK"])
        world_size = int(os.environ["WORLD_SIZE"])

        print(f"I am rank: {rank}!")

        acc_kwargs = {
            "gradient_accumulation_steps": self.grad_accum,
            "project_dir": self.output_dir,
            "project_config": ProjectConfiguration(
                project_dir=self.output_dir,
                automatic_checkpoint_naming=True,
            ),
            "mixed_precision": "bf16" if self.use_bfloat16 else "fp16"
        } 

        acc_kwargs = {**acc_kwargs, **self.accelerator_kwargs}

        self.accelerator = Accelerator(**acc_kwargs)

        modified_load_kwargs = self.model_load_kwargs

        if self.use_fsdp or self.use_deep_speed:
            if self.use_fsdp:
                modified_load_kwargs["low_cpu_mem_usage"] = True

            del modified_load_kwargs["device_map"]
            modified_load_kwargs["torch_dtype"] = torch.bfloat16 if self.use_bfloat16 else torch.float16

        if (self.use_fsdp or self.use_deep_speed) and rank != 0:
            with init_empty_weights():
                self.model = AutoModelForCausalLM.from_pretrained(
                    self.model_name_or_path,
                    quantization_config=self.bnb_config,
                    **modified_load_kwargs,
                )
        else:
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name_or_path,
                quantization_config=self.bnb_config,
                **modified_load_kwargs,
            )

Launched with accelerate launch --config_file deep.cfg <SCRIPT>

Reproduction

Have two connected nodes with differing numbers of gpus, preferably even and odd (in my case one node has 4 and one has 1), launch DeepSpeed with either standard or pdsh launchers, watch the device ordinals be incorrect on one of the machines when attempting to load model resources.

Expected behavior

In both cases standard and pdsh, but pdsh in particular since it takes a hostfile with the number of slots per host, I would expect accelerate to to launch processes equivalent to the number of available devices per host and not just attempt to evenly split the processes across hosts.

For further context my full cluster will have 9 devices, two nodes of 4 and one node of 1. to answer why the weird setup. It just seems like such a waste to let my rtx a6000 sit out of all the fun my 3090s are having, plus the additional system and vram from it. 😄

The text was updated successfully, but these errors were encountered:

iantbutler01 · 2024-05-16T04:11:01Z

I'm not entirely sure where it's getting these from, but it seems like its ignoring the config file entirely.

iantbutler01 · 2024-05-16T05:12:15Z

It looks like the issue stems from this part of the accelerate launcher:

accelerate/src/accelerate/utils/launch.py

Line 309 in 4ad4d28

cmd.extend(["--num_gpus", str(args.num_processes // args.num_machines)])

It's not considering individual machine's devices or the slots specified in the deepspeed hostfile or anything like that.

muellerzr · 2024-05-17T01:53:28Z

Correct. We don’t support that currently. At least on the torch side, I believe they require the same number of machines per node.

Does DeepSpeed not?

iantbutler01 · 2024-05-17T04:44:53Z

I had looked into it and I don't believe an equal number of devices is strictly required on either torch or DeepSpeed. Nothing with NCCL or MPI / DeepSpeeds launcher would make me immediately think that.

With that said from my end I’m more than happy to implement changes here to support this and contribute back I just want to make sure that’s desirable from your guys end.

iantbutler01 · 2024-05-27T08:32:53Z

@muellerzr It works fine if I just comment out that line above in a fork, I'd like your opinion on how to proceed here.

Attached is a screenshot showing the training session in the top right on a custom loop using accelerate.

The one GPU machine is printing SMI on the left and the 4 gpu box on the bottom right.

For more context DeepSpeed can handle arbitrary arrangements of GPUs during training, it will respect whatever number of slots are set in the provided hostfile.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node. #2780

Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node. #2780

iantbutler01 commented May 14, 2024 •

edited

iantbutler01 commented May 16, 2024

iantbutler01 commented May 16, 2024 •

edited

muellerzr commented May 17, 2024

iantbutler01 commented May 17, 2024 •

edited

iantbutler01 commented May 27, 2024 •

edited

Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node. #2780

Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node. #2780

Comments

iantbutler01 commented May 14, 2024 • edited

System Info

Information

Tasks

Reproduction

Expected behavior

iantbutler01 commented May 16, 2024

iantbutler01 commented May 16, 2024 • edited

muellerzr commented May 17, 2024

iantbutler01 commented May 17, 2024 • edited

iantbutler01 commented May 27, 2024 • edited

iantbutler01 commented May 14, 2024 •

edited

iantbutler01 commented May 16, 2024 •

edited

iantbutler01 commented May 17, 2024 •

edited

iantbutler01 commented May 27, 2024 •

edited