GPU Memory Imbalance and OOM Errors During Training #2789

DONGRYEOLLEE1 · 2024-05-17T08:21:00Z

System Info

- `Accelerate` version: 0.30.0
- Platform: Linux-6.5.0-27-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /data/envs/tt/bin/accelerate
- Python version: 3.10.12
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.2+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 125.62 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: True
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'deepspeed_config_file': '/data/dev/', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []
        - dynamo_config: {'dynamo_backend': 'EAGER', 'dynamo_mode': 'default', 'dynamo_use_dynamic': False, 'dynamo_use_fullgraph': False}

Information

The official example scripts
My own modified scripts

Tasks

One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
My own task or dataset (give details below)

Reproduction

I was training a Llama3-8B-IT model with QLoRA. I successfully proceeded with the training, but the GPU memory was not allocated evenly. As a result, I encountered an OOM error before completing even 100 steps. Upon checking the GPU memory during training, the imbalance appeared to be even more severe. In my case, GPU 1 used more memory than GPU 0.

I have experience with learning evenly on previous A100*8 servers, but I don't know if this is an issue in this case.

Below are the results of checking the GPU memory using nvidia-smi during training.
The issue of memory imbalance allocation is very serious!

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    207219      C   /data/envs/tt/bin/python                  13090MiB |
|    1   N/A  N/A    207219      C   /data/envs/tt/bin/python                  32774MiB |
+---------------------------------------------------------------------------------------+

This is my script:

quantization_config = BitsAndBytesConfig(
    load_in_4bit = True, 
    bnb_4bit_compute_dtype = torch.bfloat16, 
    bnb_4bit_quant_type = "nf4", 
    bnb_4bit_use_double_quant = True
)

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"

tok = AutoTokenizer.from_pretrained(MODEL_ID)
tok.pad_token_id = tok.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config = quantization_config,
    device_map = 'auto'
)

data = load_dataset("...")

proc_data = data.map(process, remove_columns = data['train'].column_names)

toknized_proc_data = proc_data.map(lambda x: tok(x['text'], truncation = True, max_length = 2048), batched = True)
toknized_proc_data = toknized_proc_data.remove_columns("text")

lora_config = LoraConfig(
    r = 32,
    lora_alpha = 32,
    lora_dropout = 0.01,
    target_modules = "all-linear"
)

model = get_peft_model(model, lora_config)

args = TrainingArguments(
    num_train_epochs = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 4,
    learning_rate = 2e-8,
    logging_steps = 100,
    warmup_steps = 100,
    save_steps = 100,
    save_total_limit = 3,
    output_dir = "llama3-Ko-test",
    optim = "paged_adamw_8bit",
    bf16 = True,
    report_to = "wandb",
    run_name = "llama3-Ko-test",
)

def formatting_func(x):
    return [x]

model.is_parallelizable = True
model.model_parallel = True

trainer = SFTTrainer(
    model = model,
    args = args,
    train_dataset = tok_data['train'],
    formatting_func = formatting_func,
)

trainer.train()

Expected behavior

How can I resolve the GPU memory imbalance issue?

The text was updated successfully, but these errors were encountered:

DONGRYEOLLEE1 · 2024-05-17T13:16:28Z

I only changed the model to Llama2, and although the memory imbalance issue still exists, the training works well under the following conditions.

What is the issue with the Llama3 series models?

How on earth can I fix this issue?

MODEL_ID = "meta-llama/Llama-2-7b-chat-hf"  # change a model
...

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 34%   60C    P2              93W / 300W |  35556MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 46%   74C    P2             289W / 300W |  46250MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    278782      C   python                                    35422MiB |
|    1   N/A  N/A    278782      C   python                                    45988MiB |
+---------------------------------------------------------------------------------------+

muellerzr · 2024-05-17T13:38:12Z

I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan

SunMarc · 2024-05-17T16:08:09Z

Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?

DONGRYEOLLEE1 · 2024-05-20T01:32:09Z

@SunMarc

Hi @DONGRYEOLLEE1, this is most probably a peft issue. After loading the model, is the model distributed evenly across the 2 gpus ?

First of all, I really thank to your reply.

The following shows the GPU memory status right after loading the Llama3 model.

| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   39C    P8              27W / 300W |   2212MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 30%   42C    P8              33W / 300W |   3990MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    314290      C   /data/envs/llm_t/bin/python                2206MiB |
|    1   N/A  N/A    314290      C   /data/envs/llm_t/bin/python                3984MiB |
+---------------------------------------------------------------------------------------+

DONGRYEOLLEE1 · 2024-05-20T01:37:16Z

@muellerzr

I believe this is directly related to PEFT/LoRA, as when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced. (Using FSDP). cc @SunMarc @Titus-von-Koeller @BenjaminBossan

Could you let me know the version of peft you used for fine-tuning?

In my case, I used a peft==0.10.0.

muellerzr · 2024-05-20T08:14:12Z

@DONGRYEOLLEE1 i did not use PEFT, hence what I meant by full-fine-tuning with FSDP

BenjaminBossan · 2024-05-21T11:16:43Z

I tried to reproduce but still have very little experience with DeepSpeed, so I may be doing something wrong. When I try to start the script with accelerate launch, I get:

ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode

So @DONGRYEOLLEE1 did you just launch with python ...? If I do that, I also get imbalanced memory, but I'm not sure if this is using DS correctly.

when I did llama-3-FFT w/o that I did not get a CUDA OOM on 2x4090's, and usage was balanced.

Did you change anything else? As the model is bnb quantized, full fine-tuning should not work, right?

muellerzr · 2024-05-21T11:46:33Z

@BenjaminBossan I needed CPU offloading to get it working, so quite slow but no bnb/quantization was used.

DONGRYEOLLEE1 · 2024-05-22T05:37:56Z

@BenjaminBossan

I just launched in jupyter notebook instead of python script for python ....

In the end, I solved the issue using DeepSpeed + QLoRA for example.

And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

The following shows the GPU memory status when using the DS+QLoRA method. (batch_size = 2)

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 43%   67C    P2             212W / 300W |  13532MiB / 49140MiB |     84%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             203W / 300W |  12328MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      4026      C   /data/envs/llm_test/bin/python            13526MiB |
|    1   N/A  N/A      4027      C   /data/envs/llm_test/bin/python            12322MiB |
+---------------------------------------------------------------------------------------+

BenjaminBossan · 2024-05-22T09:00:43Z

In the end, I solved the issue using DeepSpeed + QLoRA for example.

And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

Hmm, I'm confused, is the issue solved or not? :)

DONGRYEOLLEE1 · 2024-05-29T01:05:06Z

In the end, I solved the issue using DeepSpeed + QLoRA for example.
And I tried actions such as changing the versions of PEFT and Accelerate, but the memory imbalance issue still exists when conducting training in Jupyter Notebook.

Hmm, I'm confused, is the issue solved or not? :)

Oh, This issue wasn't solved for my script.

BenjaminBossan · 2024-05-29T09:17:28Z

Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?

DONGRYEOLLEE1 · 2024-05-31T07:07:07Z

Could you show us how you launch the script? Also, from the last nvidia-smi output you posted, memory usage is 13532MiB and 12328MiB. This looks rather fine to me, I wouldn't expect usage to be 100% identical. Or is that referring to something else?

My tarining script is provided in the reproduction section above.

This is the state of the GPU shortly after the start of training.

|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 30%   58C    P2             145W / 300W |  13224MiB / 49140MiB |     40%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 47%   71C    P2             221W / 300W |  32908MiB / 49140MiB |     73%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

This is the state of the GPU just before OOM.

|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:1F:00.0 Off |                  Off |
| 34%   60C    P2              93W / 300W |  35556MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               Off | 00000000:8B:00.0 Off |                  Off |
| 46%   74C    P2             289W / 300W |  46250MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

The GPU state of 13.3GB / 12.4GB reflects the GPU status during training using the deepspeed+qlora methodology (It works well w/o any imbalancing !!). While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.

BenjaminBossan · 2024-05-31T09:44:38Z

My tarining script is provided in the reproduction section above.

Yes, I mean how do you launch the training script exactly?

3. While training with deepspeed did not show memory imbalance issues, using my script in Jupyter Notebook does result in such imbalance.

Thanks for clarifying. In that case, I don't think it's PEFT related. @muellerzr any idea why this could be? Is some setting not being passed correctly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Memory Imbalance and OOM Errors During Training #2789

GPU Memory Imbalance and OOM Errors During Training #2789

DONGRYEOLLEE1 commented May 17, 2024

DONGRYEOLLEE1 commented May 17, 2024 •

edited

muellerzr commented May 17, 2024 •

edited

SunMarc commented May 17, 2024

DONGRYEOLLEE1 commented May 20, 2024

DONGRYEOLLEE1 commented May 20, 2024

muellerzr commented May 20, 2024

BenjaminBossan commented May 21, 2024

muellerzr commented May 21, 2024

DONGRYEOLLEE1 commented May 22, 2024 •

edited

BenjaminBossan commented May 22, 2024

DONGRYEOLLEE1 commented May 29, 2024

BenjaminBossan commented May 29, 2024

DONGRYEOLLEE1 commented May 31, 2024

BenjaminBossan commented May 31, 2024

GPU Memory Imbalance and OOM Errors During Training #2789

GPU Memory Imbalance and OOM Errors During Training #2789

Comments

DONGRYEOLLEE1 commented May 17, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

DONGRYEOLLEE1 commented May 17, 2024 • edited

muellerzr commented May 17, 2024 • edited

SunMarc commented May 17, 2024

DONGRYEOLLEE1 commented May 20, 2024

DONGRYEOLLEE1 commented May 20, 2024

muellerzr commented May 20, 2024

BenjaminBossan commented May 21, 2024

muellerzr commented May 21, 2024

DONGRYEOLLEE1 commented May 22, 2024 • edited

BenjaminBossan commented May 22, 2024

DONGRYEOLLEE1 commented May 29, 2024

BenjaminBossan commented May 29, 2024

DONGRYEOLLEE1 commented May 31, 2024

BenjaminBossan commented May 31, 2024

DONGRYEOLLEE1 commented May 17, 2024 •

edited

muellerzr commented May 17, 2024 •

edited

DONGRYEOLLEE1 commented May 22, 2024 •

edited