Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to launch DeepSpeed multinode training with a heterogenous mix of # devices per node. #2780

Open
2 of 4 tasks
iantbutler01 opened this issue May 14, 2024 · 5 comments

Comments

@iantbutler01
Copy link
Contributor

iantbutler01 commented May 14, 2024

System Info

- `Accelerate` version: 0.30.0
- Platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/crow/venvs/bismuth/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 124.95 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` config passed:
	- compute_environment: LOCAL_MACHINE
	- distributed_type: DEEPSPEED
	- mixed_precision: bf16
	- use_cpu: False
	- debug: False
	- num_processes: 5
	- machine_rank: 0
	- num_machines: 2
	- main_process_ip: localhost
	- main_process_port: 3141
	- rdzv_backend: static
	- same_network: True
	- main_training_function: main
	- enable_cpu_affinity: False
	- deepspeed_config: {'deepspeed_hostfile': '/home/crow/SoftwareProjects/rwkv-raven-lora-instruct/hostfile', 'deepspeed_multinode_launcher': 'pdsh', 'gradient_accumulation_steps': 8, 'offload_optimizer_device': 'cpu', 'offload_param_device': 'cpu', 'zero3_init_flag': True, 'zero3_save_16bit_model': False, 'zero_stage': 3}
	- downcast_bf16: no
	- tpu_use_cluster: False
	- tpu_use_sudo: False
	- tpu_env: []

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

My snippet of code for loading model resources:

        rank = int(os.environ["RANK"])
        world_size = int(os.environ["WORLD_SIZE"])

        print(f"I am rank: {rank}!")

        acc_kwargs = {
            "gradient_accumulation_steps": self.grad_accum,
            "project_dir": self.output_dir,
            "project_config": ProjectConfiguration(
                project_dir=self.output_dir,
                automatic_checkpoint_naming=True,
            ),
            "mixed_precision": "bf16" if self.use_bfloat16 else "fp16"
        } 

        acc_kwargs = {**acc_kwargs, **self.accelerator_kwargs}

        self.accelerator = Accelerator(**acc_kwargs)

        modified_load_kwargs = self.model_load_kwargs

        if self.use_fsdp or self.use_deep_speed:
            if self.use_fsdp:
                modified_load_kwargs["low_cpu_mem_usage"] = True

            del modified_load_kwargs["device_map"]
            modified_load_kwargs["torch_dtype"] = torch.bfloat16 if self.use_bfloat16 else torch.float16

        if (self.use_fsdp or self.use_deep_speed) and rank != 0:
            with init_empty_weights():
                self.model = AutoModelForCausalLM.from_pretrained(
                    self.model_name_or_path,
                    quantization_config=self.bnb_config,
                    **modified_load_kwargs,
                )
        else:
            self.model = AutoModelForCausalLM.from_pretrained(
                self.model_name_or_path,
                quantization_config=self.bnb_config,
                **modified_load_kwargs,
            )

Launched with accelerate launch --config_file deep.cfg <SCRIPT>

Reproduction

Have two connected nodes with differing numbers of gpus, preferably even and odd (in my case one node has 4 and one has 1), launch DeepSpeed with either standard or pdsh launchers, watch the device ordinals be incorrect on one of the machines when attempting to load model resources.

Expected behavior

In both cases standard and pdsh, but pdsh in particular since it takes a hostfile with the number of slots per host, I would expect accelerate to to launch processes equivalent to the number of available devices per host and not just attempt to evenly split the processes across hosts.

For further context my full cluster will have 9 devices, two nodes of 4 and one node of 1. to answer why the weird setup. It just seems like such a waste to let my rtx a6000 sit out of all the fun my 3090s are having, plus the additional system and vram from it. 😄

@iantbutler01
Copy link
Contributor Author

Screenshot 2024-05-15 at 9 09 26 PM

I'm not entirely sure where it's getting these from, but it seems like its ignoring the config file entirely.

@iantbutler01
Copy link
Contributor Author

iantbutler01 commented May 16, 2024

It looks like the issue stems from this part of the accelerate launcher:

cmd.extend(["--num_gpus", str(args.num_processes // args.num_machines)])

It's not considering individual machine's devices or the slots specified in the deepspeed hostfile or anything like that.

@muellerzr
Copy link
Collaborator

Correct. We don’t support that currently. At least on the torch side, I believe they require the same number of machines per node.

Does DeepSpeed not?

@iantbutler01
Copy link
Contributor Author

iantbutler01 commented May 17, 2024

I had looked into it and I don't believe an equal number of devices is strictly required on either torch or DeepSpeed. Nothing with NCCL or MPI / DeepSpeeds launcher would make me immediately think that.

With that said from my end I’m more than happy to implement changes here to support this and contribute back I just want to make sure that’s desirable from your guys end.

@iantbutler01
Copy link
Contributor Author

iantbutler01 commented May 27, 2024

@muellerzr It works fine if I just comment out that line above in a fork, I'd like your opinion on how to proceed here.

Attached is a screenshot showing the training session in the top right on a custom loop using accelerate.

The one GPU machine is printing SMI on the left and the 4 gpu box on the bottom right.

For more context DeepSpeed can handle arbitrary arrangements of GPUs during training, it will respect whatever number of slots are set in the provided hostfile.

Screenshot 2024-05-27 at 1 30 22 AM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants