fix jamba slow foward for multi-gpu #30418

SunMarc · 2024-04-23T11:06:51Z

What does this PR do ?

This PR makes the slow forward of Jamba model compatible with multi-gpu setup with use_cache=True. The issue is that the cache is not initialized on the right device. Not sure we want to refactor the cache since the key_cache and and value_cache are handled correctly. This is just specific to ssm_states. Moving the tensor manually should only impact the first forward also. After the first forward, the cache will be correctly initialized.
Fixes #30367

Code:

import os
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig

prompt = "def quicksort(arr):\n"
tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-tiny-random")
tokenizer.pad_token = tokenizer.eos_token
config = AutoConfig.from_pretrained("ai21labs/Jamba-tiny-random")
# change the following to switch between fast and slow forward 
config.use_mamba_kernels=False 
model = AutoModelForCausalLM.from_pretrained(
    "ai21labs/Jamba-tiny-random",
    device_map="auto",
    config = config
)
print(model.hf_device_map)
print(model.config)
inputs = tokenizer.encode(prompt, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=512, do_sample=False)
print(tokenizer.decode(outputs[0]))

HuggingFaceDocBuilderDev · 2024-04-23T11:26:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Seems like it's the best tradeoff:

transformers/src/transformers/models/mamba/modeling_mamba.py

Line 63 in 9eac9c7

class MambaCache:

could be changed but would need to accomodate for every layers? (or is the PKV handled by accelerate separately? )

SunMarc · 2024-04-24T12:19:04Z

could be changed but would need to accommodate for every layers? (or is the PKV handled by accelerate separately? )

Yes, we would need to accommodate for every layer but that can become quite complicated if we start to introduce layers that are offloaded to the disk and cpu. Hence, I think that it is not worth investing too much effort on that. The PKV is not handled by accelerate separately.

SunMarc added 2 commits April 23, 2024 11:56

fix jamba slow foward for multi-gpu

a38cc70

remove comm

9936cb0

SunMarc requested a review from ArthurZucker April 23, 2024 11:06

oups

94668a9

style

Loading
Loading status checks…

ce40301

ArthurZucker reviewed Apr 23, 2024

View reviewed changes

ArthurZucker approved these changes Apr 23, 2024

View reviewed changes

SunMarc merged commit 37fa1f6 into huggingface:main Apr 24, 2024
18 checks passed

SunMarc mentioned this pull request May 16, 2024

Fix accelerate failing tests #30836

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix jamba slow foward for multi-gpu #30418

fix jamba slow foward for multi-gpu #30418

SunMarc commented Apr 23, 2024

HuggingFaceDocBuilderDev commented Apr 23, 2024

ArthurZucker left a comment

SunMarc commented Apr 24, 2024

fix jamba slow foward for multi-gpu #30418

fix jamba slow foward for multi-gpu #30418

Conversation

SunMarc commented Apr 23, 2024

What does this PR do ?

HuggingFaceDocBuilderDev commented Apr 23, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

SunMarc commented Apr 24, 2024