Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda Out of memory while loading PEFT weights using accelerate on multi gpu #2760

Open
2 of 4 tasks
sidtandon2014 opened this issue May 10, 2024 · 2 comments
Open
2 of 4 tasks

Comments

@sidtandon2014
Copy link

System Info

bitsandbytes==0.43.0
peft==0.10.0
trl==0.7.10
accelerate==0.27.1
transformers==4.38.1

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • My own task or dataset (give details below)

Reproduction

I have fine tuned gemma 7B model, but getting "Cuda Out of memory error" while inferencing on 8 GPUs. As per the error trace this is happening while loading peft weights. While debugging I have observed that main GPU is creating multiple processes for each child GPU to load peft weights, and that's when Out of memory issue happening. If I reduce the number of GPUs to 2 this works fine

Script for loading model

accelerator = Accelerator()

model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
         token=os.environ['HF_TOKEN'],
        device_map={"": accelerator.process_index},
    )

model = PeftModel.from_pretrained(model
                                      , adapter_checkpoint_dir
                                      , is_trainable=False
                                     )

Error trace

Traceback (most recent call last):
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 262, in <module>
    fire.Fire(main)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 233, in main
Traceback (most recent call last):
      File "/home/jupyter/llm-supervised-finetuning/inference.py", line 262, in <module>
model = get_model(model_name, adapter_checkpoint_dir, model_cfg, bnb_config,
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 140, in get_model
    model = PeftModel.from_pretrained(model
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained
    fire.Fire(main)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    model.load_adapter(model_id, adapter_name, is_trainable=is_trainable, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 727, in load_adapter
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
        adapters_weights = load_peft_weights(model_id, device=torch_device, **hf_hub_download_kwargs)component = fn(*varargs, **kwargs)

  File "/opt/conda/lib/python3.10/site-packages/peft/utils/save_and_load.py", line 326, in load_peft_weights
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 233, in main
    model = get_model(model_name, adapter_checkpoint_dir, model_cfg, bnb_config,
  File "/home/jupyter/llm-supervised-finetuning/inference.py", line 140, in get_model
    model = PeftModel.from_pretrained(model
  File "/opt/conda/lib/python3.10/site-packages/peft/peft_model.py", line 356, in from_pretrained
    adapters_weights = safe_load_file(filename, device=device)
  File "/opt/conda/lib/python3.10/site-packages/safetensors/torch.py", line 310, in load_file
    result[k] = f.get_tensor(k)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.98 GiB. GPU 0 has a total capacity of 21.96 GiB of which 1.06 GiB is free. Process 3491179 has 13.83 GiB memory in use. Process 3491186 has 3.16 GiB memory in use. Process 3491180 has 3.16 GiB memory in use. Including non-PyTorch memory, this process has 184.00 MiB memory in use. Process 3491183 has 184.00 MiB memory in use. Process 3491185 has 184.00 MiB memory in use. Process 3491182 has 184.00 MiB memory in use. Of the allocated memory 0 bytes is allocated by PyTorch, and 0 bytes is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

nvidia-smi output while loading base model
image

nvidia-smi output with peft weights
image

Expected behavior

Inference script should run on all GPUs

@jweihe
Copy link

jweihe commented Jun 6, 2024

same problems, have you solved it?

@BenjaminBossan
Copy link
Member

Could you show the LoRA config that you used for this model? How large is the LoRA adapter (the adapter_model.safetensors or adapter_model.bin file)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants