You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: This is not really an issue reporting. This is more of asking for feedback if the changes I made to make accelerate and deepspeed work together.
With these changes, I am able to successfully save and resume from checkpoints when using the following accelerate config (in a multi-GPU single-node setup):
TL;DR: This is not really an issue reporting. This is more of asking for feedback if the changes I made to make
accelerate
anddeepspeed
work together.So, I have this training script: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py. The following patch reflects the changes I made to it to make it work with DeepSpeed:
ds.patch
With these changes, I am able to successfully save and resume from checkpoints when using the following
accelerate
config (in a multi-GPU single-node setup):Example training commands
Start an initial run:
Then resume from a checkpoint:
Python dependencies to run the script can be found here: https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/requirements.txt.
accelerate-cli env
LMK if you need more information here.
The text was updated successfully, but these errors were encountered: