Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Memory Imbalance and OOM Errors During Training #2789

Open
2 of 4 tasks
DONGRYEOLLEE1 opened this issue May 17, 2024 · 14 comments
Open
2 of 4 tasks

GPU Memory Imbalance and OOM Errors During Training #2789

DONGRYEOLLEE1 opened this issue May 17, 2024 · 14 comments

Comments

@DONGRYEOLLEE1
Copy link

System Info

- `Accelerate` version: 0.30.0
- Platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /data/envs/tt/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 125.62 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: True
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'deepspeed_config_file': '/data/dev/', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
        - dynamo_config: {'dynamo_backend': 'EAGER', 'dynamo_mode': 'default', 'dynamo_use_dynamic': False, 'dynamo_use_fullgraph': False}

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successfully proceeded with the training, but the GPU memory was not allocated evenly. As a result, I encountered an OOM error before completing even 100 steps. Upon checking the GPU memory during training, the imbalance appeared to be even more severe. In my case, GPU 1 used more memory than GPU 0.

I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

Below are the results of checking the GPU memory using nvidia-smi during training.
The issue of memory imbalance allocation is very serious!

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

This is my script:

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 32,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

args = TrainingArguments(
    num_train_epochs = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-Ko-test",
    optim = "paged_adamw_8bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-Ko-test",
)

def formatting_func(x):
    return [x]

model.is_parallelizable = True
model.model_parallel = True

trainer = SFTTrainer(
    model = model,
    args = args,
    train_dataset = tok_data['train'],
    formatting_func = formatting_func,
)

trainer.train()

Expected behavior

How can I resolve the GPU memory imbalance issue?

@DONGRYEOLLEE1
Copy link
Author

DONGRYEOLLEE1 commented May 17, 2024

I only changed the model to Llama2, and although the memory imbalance issue still exists, the training works well under the following conditions.

What is the issue with the Llama3 series models?

How on earth can I fix this issue?

MODEL_ID = "meta-llama/Llama-2-7b-chat-hf"  # change a model
...
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 34%   60C    P2              93W / 300W |  35556MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 46%   74C    P2             289W / 300W |  46250MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    278782      C   python                                    35422MiB |
|    1   N/A  N/A    278782      C   python                                    45988MiB |
+---------------------------------------------------------------------------------------+

@muellerzr
Copy link
Collaborator

muellerzr commented May 17, 2024

I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan

@SunMarc
Copy link
Member

SunMarc commented May 17, 2024

Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?

@DONGRYEOLLEE1
Copy link
Author

@SunMarc

Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?

First of all, I really thank to your reply.

The following shows the GPU memory status right after loading the Llama3 model.

| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   39C    P8              27W / 300W |   2212MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 30%   42C    P8              33W / 300W |   3990MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    314290      C   /data/envs/llm_t/bin/python                2206MiB |
|    1   N/A  N/A    314290      C   /data/envs/llm_t/bin/python                3984MiB |
+---------------------------------------------------------------------------------------+

@DONGRYEOLLEE1
Copy link
Author

@muellerzr

I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan

Could you let me know the version of peft you used for fine-tuning?

In my case, I used a peft==0.10.0.

@muellerzr
Copy link
Collaborator

@DONGRYEOLLEE1 i did not use PEFT, hence what I meant by full-fine-tuning with FSDP

@BenjaminBossan
Copy link
Member

I tried to reproduce but still have very little experience with DeepSpeed, so I may be doing something wrong. When I try to start the script with accelerate launch, I get:

ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode

So @DONGRYEOLLEE1 did you just launch with python ...? If I do that, I also get imbalanced memory, but I'm not sure if this is using DS correctly.

when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced.

Did you change anything else? As the model is bnb quantized, full fine-tuning should not work, right?

@muellerzr
Copy link
Collaborator

@BenjaminBossan I needed CPU offloading to get it working, so quite slow but no bnb/quantization was used.

@DONGRYEOLLEE1
Copy link
Author

DONGRYEOLLEE1 commented May 22, 2024

@BenjaminBossan

I just launched in jupyter notebook instead of python script for python ....

In the end, I solved the issue using DeepSpeed + QLoRA for example.

And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

The following shows the GPU memory status when using the DS+QLoRA method. (batch_size = 2)

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 43%   67C    P2             212W / 300W |  13532MiB / 49140MiB |     84%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             203W / 300W |  12328MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      4026      C   /data/envs/llm_test/bin/python            13526MiB |
|    1   N/A  N/A      4027      C   /data/envs/llm_test/bin/python            12322MiB |
+---------------------------------------------------------------------------------------+

@BenjaminBossan
Copy link
Member

In the end, I solved the issue using DeepSpeed + QLoRA for example.

And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

Hmm, I'm confused, is the issue solved or not? :)

@DONGRYEOLLEE1
Copy link
Author

In the end, I solved the issue using DeepSpeed + QLoRA for example.
And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

Hmm, I'm confused, is the issue solved or not? :)

Oh, This issue wasn't solved for my script.

@BenjaminBossan
Copy link
Member

Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?

@DONGRYEOLLEE1
Copy link
Author

Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?

My tarining script is provided in the reproduction section above.

  1. This is the state of the GPU shortly after the start of training.
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  1. This is the state of the GPU just before OOM.
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 34%   60C    P2              93W / 300W |  35556MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 46%   74C    P2             289W / 300W |  46250MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
  1. The GPU state of 13.3GB / 12.4GB reflects the GPU status during training using the deepspeed+qlora methodology (It works well w/o any imbalancing !!). While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.

@BenjaminBossan
Copy link
Member

My tarining script is provided in the reproduction section above.

Yes, I mean how do you launch the training script exactly?

3. While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.

Thanks for clarifying. In that case, I don't think it's PEFT related. @muellerzr any idea why this could be? Is some setting not being passed correctly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants