Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DeepSpeed] Asking for feedback when training with zero2 with accelerate and diffusers #2787

Open
sayakpaul opened this issue May 16, 2024 · 2 comments

Comments

@sayakpaul
Copy link
Member

TL;DR: This is not really an issue reporting. This is more of asking for feedback if the changes I made to make accelerate and deepspeed work together.

So, I have this training script: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py. The following patch reflects the changes I made to it to make it work with DeepSpeed:

ds.patch

With these changes, I am able to successfully save and resume from checkpoints when using the following accelerate config (in a multi-GPU single-node setup):

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Example training commands

Start an initial run:

CUDA_VISIBLE_DEVICES=2,3 accelerate launch --config_file ds2.yaml train_text_to_image.py   --pretrained_model_name_or_path=$MODEL_NAME   --dataset_name=$DATASET_NAME   --mixed_precision="fp16"   --resolution=512 --center_crop --random_flip   --train_batch_size=1   --gradient_checkpointing   --max_train_steps=4   --learning_rate=1e-05   --checkpointing_steps=1 --output_dir=/raid/.cache/huggingface/finetuning-ds

Then resume from a checkpoint:

CUDA_VISIBLE_DEVICES=2,3 accelerate launch --config_file ds2.yaml train_text_to_image.py   --pretrained_model_name_or_path=$MODEL_NAME   --dataset_name=$DATASET_NAME   --mixed_precision="fp16"   --resolution=512 --center_crop --random_flip   --train_batch_size=1   --gradient_checkpointing   --max_train_steps=6   --learning_rate=1e-05 --resume_from_checkpoint="latest"  --checkpointing_steps=1 --output_dir=/raid/.cache/huggingface/finetuning-ds

Python dependencies to run the script can be found here: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/requirements.txt.

accelerate-cli env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.30.1
- Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/sayak/.pyenv/versions/3.10.12/envs/diffusers/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.24.1
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 503.54 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        Not found

LMK if you need more information here.

@sayakpaul
Copy link
Member Author

Cc: @muellerzr @SunMarc

@SunMarc
Copy link
Member

SunMarc commented May 16, 2024

cc @pacman100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants