-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU Memory Imbalance and OOM Errors During Training #2789
Comments
I only changed the model to Llama2, and although the memory imbalance issue still exists, the training works well under the following conditions. What is the issue with the Llama3 series models? How on earth can I fix this issue? MODEL_ID = "meta-llama/Llama-2-7b-chat-hf" # change a model
...
|
I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan |
Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ? |
First of all, I really thank to your reply. The following shows the GPU memory status right after loading the Llama3 model.
|
Could you let me know the version of In my case, I used a |
@DONGRYEOLLEE1 i did not use PEFT, hence what I meant by full-fine-tuning with FSDP |
I tried to reproduce but still have very little experience with DeepSpeed, so I may be doing something wrong. When I try to start the script with
So @DONGRYEOLLEE1 did you just launch with
Did you change anything else? As the model is bnb quantized, full fine-tuning should not work, right? |
@BenjaminBossan I needed CPU offloading to get it working, so quite slow but no bnb/quantization was used. |
I just launched in jupyter notebook instead of python script for In the end, I solved the issue using DeepSpeed + QLoRA for example. And I tried actions such as changing the versions of The following shows the GPU memory status when using the DS+QLoRA method. (batch_size = 2)
|
Hmm, I'm confused, is the issue solved or not? :) |
Oh, This issue wasn't solved for my script. |
Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else? |
My tarining script is provided in the reproduction section above.
|
Yes, I mean how do you launch the training script exactly?
Thanks for clarifying. In that case, I don't think it's PEFT related. @muellerzr any idea why this could be? Is some setting not being passed correctly? |
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I was training a Llama3-8B-IT model with QLoRA. I successfully proceeded with the training, but the GPU memory was not allocated evenly. As a result, I encountered an OOM error before completing even 100 steps. Upon checking the GPU memory during training, the imbalance appeared to be even more severe. In my case, GPU 1 used more memory than GPU 0.
I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.
Below are the results of checking the GPU memory using
nvidia-smi
during training.The issue of memory imbalance allocation is very serious!
This is my script:
Expected behavior
How can I resolve the GPU memory imbalance issue?
The text was updated successfully, but these errors were encountered: