[DeepSpeed] Asking for feedback when training with zero2 with accelerate and diffusers #2787

sayakpaul · 2024-05-16T12:43:33Z

TL;DR: This is not really an issue reporting. This is more of asking for feedback if the changes I made to make accelerate and deepspeed work together.

So, I have this training script: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py. The following patch reflects the changes I made to it to make it work with DeepSpeed:

ds.patch

With these changes, I am able to successfully save and resume from checkpoints when using the following accelerate config (in a multi-GPU single-node setup):

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Example training commands

Start an initial run:

CUDA_VISIBLE_DEVICES=2,3 accelerate launch --config_file ds2.yaml train_text_to_image.py   --pretrained_model_name_or_path=$MODEL_NAME   --dataset_name=$DATASET_NAME   --mixed_precision="fp16"   --resolution=512 --center_crop --random_flip   --train_batch_size=1   --gradient_checkpointing   --max_train_steps=4   --learning_rate=1e-05   --checkpointing_steps=1 --output_dir=/raid/.cache/huggingface/finetuning-ds

Then resume from a checkpoint:

CUDA_VISIBLE_DEVICES=2,3 accelerate launch --config_file ds2.yaml train_text_to_image.py   --pretrained_model_name_or_path=$MODEL_NAME   --dataset_name=$DATASET_NAME   --mixed_precision="fp16"   --resolution=512 --center_crop --random_flip   --train_batch_size=1   --gradient_checkpointing   --max_train_steps=6   --learning_rate=1e-05 --resume_from_checkpoint="latest"  --checkpointing_steps=1 --output_dir=/raid/.cache/huggingface/finetuning-ds

Python dependencies to run the script can be found here: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/requirements.txt.

`accelerate-cli env`

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.30.1
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/sayak/.pyenv/versions/3.10.12/envs/diffusers/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 503.54 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        Not found

LMK if you need more information here.

The text was updated successfully, but these errors were encountered:

sayakpaul · 2024-05-16T12:44:22Z

Cc: @muellerzr @SunMarc

SunMarc · 2024-05-16T16:28:27Z

cc @pacman100

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepSpeed] Asking for feedback when training with zero2 with accelerate and diffusers #2787

[DeepSpeed] Asking for feedback when training with zero2 with accelerate and diffusers #2787

sayakpaul commented May 16, 2024

sayakpaul commented May 16, 2024

SunMarc commented May 16, 2024

[DeepSpeed] Asking for feedback when training with zero2 with accelerate and diffusers #2787

[DeepSpeed] Asking for feedback when training with zero2 with accelerate and diffusers #2787

Comments

sayakpaul commented May 16, 2024

Example training commands

accelerate-cli env

sayakpaul commented May 16, 2024

SunMarc commented May 16, 2024

`accelerate-cli env`