[Bugfix] fix rope error when load models with different dtypes #4835

jinzhen-lin · 2024-05-15T15:17:35Z

Currently, if we load models with different dtypes in the same process, we would get an error like

File ~/.miniconda3/lib/python3.8/site-packages/vllm/_custom_ops.py:89, in rotary_embedding(positions, query, key, head_size, cos_sin_cache, is_neox)
     81 def rotary_embedding(
     82     positions: torch.Tensor,
     83     query: torch.Tensor,
   (...)
     87     is_neox: bool,
     88 ) -> None:
---> 89     vllm_ops.rotary_embedding(positions, query, key, head_size, cos_sin_cache,
     90                               is_neox)

RuntimeError: expected scalar type BFloat16 but found Half

To reproduce:

import torch
from vllm import LLM

model_fp16 = LLM("Qwen/Qwen1.5-0.5B", dtype=torch.half, gpu_memory_utilization=0.4)
model_bf16 = LLM("Qwen/Qwen1.5-0.5B", dtype=torch.bfloat16, gpu_memory_utilization=0.4)

The bug is caused by the rope cache, different dtypes share the same rope module. This PR add dtype to cache key to fix this bug.

Yard1 · 2024-05-15T18:06:30Z

vllm/model_executor/layers/rotary_embedding.py

@@ -474,7 +474,7 @@ def get_rope(
    else:
        rope_scaling_args = None
    key = (head_size, rotary_dim, max_position, base, is_neox_style,
-           rope_scaling_args)
+           rope_scaling_args, torch.get_default_dtype())


can we pass the dtype as an argument instead?

rkooo567 · 2024-05-16T06:34:58Z

vllm/model_executor/layers/rotary_embedding.py

@@ -463,7 +468,10 @@ def get_rope(
    base: int,
    is_neox_style: bool = True,
    rope_scaling: Optional[Dict[str, Any]] = None,
+    dtype: Optional[torch.dtype] = None,


QQ: is it difficult to always require to pass the dtype instead?

I notice that linear module in vllm set param_dtype as an optional argument, so I think it may be better to keep the same.

rkooo567

import torch
from vllm import LLM

model_fp16 = LLM("Qwen/Qwen1.5-0.5B", dtype=torch.half, gpu_memory_utilization=0.4)
model_bf16 = LLM("Qwen/Qwen1.5-0.5B", dtype=torch.bfloat16, gpu_memory_utilization=0.4)

Can you add this as a regression test? And then it lgtm

jinzhen-lin · 2024-05-17T06:43:01Z

import torch
from vllm import LLM

model_fp16 = LLM("Qwen/Qwen1.5-0.5B", dtype=torch.half, gpu_memory_utilization=0.4)
model_bf16 = LLM("Qwen/Qwen1.5-0.5B", dtype=torch.bfloat16, gpu_memory_utilization=0.4)

Can you add this as a regression test? And then it lgtm

I add a rope module cache test instead of model test, is that ok?

rkooo567

Yeah test lgtm!

…project#4835)

add multi dtype support for rope cache

ad7e7f6

Yard1 reviewed May 15, 2024

View reviewed changes

add dtype argument to rope module

6652d72

jinzhen-lin mentioned this pull request May 16, 2024

[Kernel] add bfloat16 support for gptq marlin kernel #4788

Merged

rkooo567 reviewed May 16, 2024

View reviewed changes

jinzhen-lin added 2 commits May 17, 2024 14:23

add rope module cache test

754d31f

fix ruff error

1284853

rkooo567 approved these changes May 17, 2024

View reviewed changes

rkooo567 merged commit 33e0823 into vllm-project:main May 17, 2024
55 checks passed

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

[Bugfix] fix rope error when load models with different dtypes (vllm-…

3b9b8e5

…project#4835)

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Bugfix] fix rope error when load models with different dtypes (vllm-…

1f614f9

…project#4835)

tybalex pushed a commit to tybalex/vllm-function-call that referenced this pull request May 25, 2024

[Bugfix] fix rope error when load models with different dtypes (vllm-…

0afe46d

…project#4835)

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Bugfix] fix rope error when load models with different dtypes (vllm-…

5e9d3cf

…project#4835)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] fix rope error when load models with different dtypes #4835

[Bugfix] fix rope error when load models with different dtypes #4835

jinzhen-lin commented May 15, 2024

Yard1 May 15, 2024

jinzhen-lin May 16, 2024

rkooo567 May 16, 2024

jinzhen-lin May 16, 2024

rkooo567 left a comment

jinzhen-lin commented May 17, 2024

rkooo567 left a comment

[Bugfix] fix rope error when load models with different dtypes #4835

[Bugfix] fix rope error when load models with different dtypes #4835

Conversation

jinzhen-lin commented May 15, 2024

Yard1 May 15, 2024

Choose a reason for hiding this comment

jinzhen-lin May 16, 2024

Choose a reason for hiding this comment

rkooo567 May 16, 2024

Choose a reason for hiding this comment

jinzhen-lin May 16, 2024

Choose a reason for hiding this comment

rkooo567 left a comment

Choose a reason for hiding this comment

jinzhen-lin commented May 17, 2024

rkooo567 left a comment

Choose a reason for hiding this comment