[Kernel] add bfloat16 support for gptq marlin kernel #4788

jinzhen-lin · 2024-05-13T13:44:19Z

Some models would overflow when using fp16 inference (e.g. Deepseek-V2), so we should add bfloat16 support for quantization kernel. This PR add bfloat16 support for gptq marlin kernel.

Unlike gptq kernel in #4781 , gptq marlin kernel doesn't use atomicAdd, so the performance of bfloat16 is close to float16.

Related issue: #2149

Main changes:

add bfloat16 input/output support for cuda kernels
dequant qweight to bfloat16 in proper ways.

robertgshaw2-neuralmagic · 2024-05-13T14:09:58Z

@alexm-nm can you review this?

alexm-neuralmagic

Thanks for doing the work of adding bfloat16 to marlin. Left some comments.

alexm-neuralmagic · 2024-05-13T14:32:58Z

csrc/quantization/gptq_marlin/gptq_marlin.cuh

@@ -9,6 +9,10 @@
 #include <cuda_runtime.h>
 #include <iostream>

+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800
+#include <cuda_bf16.h>


Why it is necessary to check here that SM >= 8.0? Shouldn't the "include <cuda_bf16.h> work regardless?

alexm-neuralmagic · 2024-05-13T14:33:20Z

csrc/quantization/gptq_marlin/gptq_marlin.cuh

@@ -38,6 +42,7 @@ constexpr int div_ceil(int a, int b) { return (a + b - 1) / b; }
  // No support for async
 #else

+


nit: formatting

alexm-neuralmagic · 2024-05-13T14:36:48Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

-    C_ptr += 16 * thread_m_blocks * (prob_n / 8) * par;
-  }
-}
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800


This is a problematic way to add bfloat16 support to marlin, since we should be able to compile the marlin module for both float16 and bfloat16 at the same time. Could you restructure the code to use a template parameter instead to the Marlin<...> kernel and use the template parameter for all of the functions required to have a templated type. If you don't have time, then I can take over and fix it for you. Tell me what works.

Ok, I would restructure it soon.

jinzhen-lin · 2024-05-14T08:33:08Z

@alexm-nm I have restructured code. Can you review it again.

alexm-neuralmagic

@jinzhen-lin this looks much better with the template param! I left some minor comments. Could you also add a test to test_gptq_marlin.py with some models that run with dtype.bfloat16 (so we have correctness verified on every change going forward). Again, thanks for the help!

alexm-neuralmagic · 2024-05-14T13:52:40Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

+        size_k, workspace.data_ptr(), num_bits, has_act_order, is_k_full,
+        num_groups, group_size, dev, at::cuda::getCurrentCUDAStream(dev),
+        thread_k, thread_n, sms, gptq_marlin::max_par);
+  } else if (a.scalar_type() == at::ScalarType::BFloat16) {


You had an #ifdef to check for CUDA_ARCH >= 8 above whether you access nv_bfloat16. I suppose it generates a compilation error if you don't have the ifdef. I think you should have an ifdef here as well to disable the bfloat16 case so the code compiles for SM < 8.

alexm-neuralmagic · 2024-05-14T13:53:08Z

csrc/quantization/gptq_marlin/gptq_marlin_dtypes.cuh

+};
+
+template <>
+class ScalarType<half> {


This looks much better! Thanks for doing this.

alexm-neuralmagic · 2024-05-14T13:55:17Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

+  fp32_intermediates_casted[2]  = __byte_perm(q, fp32_base, 0x7651);
+  fp32_intermediates_casted[3] = __byte_perm(q, fp32_base, 0x7653);
+
+  fp32_intermediates[0] -= 8388736.f;


On what code this dequant_8bit is based? Maybe you can document the reference you used.

alexm-neuralmagic · 2024-05-14T13:58:55Z

@bnellnm could you do a quick pass on the template changes.

bnellnm · 2024-05-14T16:32:51Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

-__device__ inline FragB dequant_4bit(int q) {
+template <typename scalar_t>
+__device__ inline typename ScalarType<scalar_t>::FragB dequant_4bit(int q) {
+  throw std::runtime_error("unsupported");


I'm not sure what the standard is but I think most checks in the code use TORCH_CHECK rather than throw.

bnellnm · 2024-05-14T16:34:43Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

+                : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
+                : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]),
+                  "r"(b[1]), "f"(c[0]), "f"(c[1]), "f"(c[2]), "f"(c[3]));
+  } else if constexpr (std::is_same<scalar_t, nv_bfloat16>::value) {


I think it would be safer to make the else clause a static_assert so if a new type were added, this function would not silently compile with an empty body, i.e.

} else { static_assert(std::is_same<scalar_t, half>::value); asm volatile(...); }

bnellnm · 2024-05-14T16:35:30Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

+
+template <typename scalar_t>
+__device__ inline typename ScalarType<scalar_t>::FragB dequant_8bit(int q) {
+  throw std::runtime_error("unsupported");


TORCH_CHECK?

bnellnm · 2024-05-14T16:42:02Z

csrc/quantization/gptq_marlin/gptq_marlin.cu

+        num_groups, group_size, dev, at::cuda::getCurrentCUDAStream(dev),
+        thread_k, thread_n, sms, gptq_marlin::max_par);
+  } else {
+    throw std::runtime_error("gpt_marlin_gemm only supports bfloat16 and float16");


TORCH_CHECK here too?

bnellnm · 2024-05-14T16:44:25Z

@bnellnm could you do a quick pass on the template changes.

The template changes look good. I had a few minor comments. Mostly the use of TORCH_CHECK over throw (which I think is more "standard").

alexm-neuralmagic · 2024-05-14T19:56:54Z

@jinzhen-lin I think your code is in good state to land after addressing last comments.

jinzhen-lin · 2024-05-16T05:51:13Z

@alexm-nm @bnellnm All previous comments have been fixed.

As for test in test_gptq_marlin.py:

since the naive gptq kernel doesn't support bf16 yet ([Kernel] add bfloat16 support for gptq kernel #4781), I compare the outputs of gptq-marlin-bf16 model to the outputs of gptq-fp16 model.
as [Bugfix] fix rope error when load models with different dtypes #4835 described, there is a bug when loading bf16 model and bf16 model in the same process, so I clear the rope cache after the first model is deleted in the code.

alexm-neuralmagic · 2024-05-16T13:51:02Z

@jinzhen-lin thanks for adding the tests and fixing all comments. @robertgshaw2-neuralmagic looks good to me to proceed forward.

robertgshaw2-neuralmagic · 2024-05-16T13:58:46Z

Thanks all!

jinzhen-lin added 4 commits May 13, 2024 21:37

add bfloat support for gptq marlin kernel

470b184

fix ruff error

758c03b

improve arch check for bfloat16

02ddc61

fix yapf error

b5711c5

alexm-neuralmagic reviewed May 14, 2024

View reviewed changes

jinzhen-lin added 4 commits May 14, 2024 14:04

restore gptq_marlin.cu

7659068

use template to restructure code

4836009

restore gptq_marlin.cuh

1712a67

fix format

a4f419c

alexm-neuralmagic reviewed May 14, 2024

View reviewed changes

add comment for dequant_8bit and template param of Marlin

d8db24f

bnellnm reviewed May 14, 2024

View reviewed changes

jinzhen-lin added 2 commits May 15, 2024 20:30

minor update gptq marlin

d8c127f

add gpt marlin test for bf16 model

65a6ccc

robertgshaw2-neuralmagic approved these changes May 16, 2024

View reviewed changes

robertgshaw2-neuralmagic merged commit 99caa49 into vllm-project:main May 16, 2024
55 checks passed

robertgshaw2-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

[Kernel] add bfloat16 support for gptq marlin kernel (vllm-project#4788)

230af21

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

[Kernel] add bfloat16 support for gptq marlin kernel (vllm-project#4788)

7bc509e

tybalex pushed a commit to tybalex/vllm-function-call that referenced this pull request May 25, 2024

[Kernel] add bfloat16 support for gptq marlin kernel (vllm-project#4788)

91d1489

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

[Kernel] add bfloat16 support for gptq marlin kernel (vllm-project#4788)

3535742

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] add bfloat16 support for gptq marlin kernel #4788

[Kernel] add bfloat16 support for gptq marlin kernel #4788

jinzhen-lin commented May 13, 2024

robertgshaw2-neuralmagic commented May 13, 2024

alexm-neuralmagic left a comment

alexm-neuralmagic May 13, 2024

alexm-neuralmagic May 13, 2024

alexm-neuralmagic May 13, 2024

jinzhen-lin May 14, 2024

jinzhen-lin commented May 14, 2024

alexm-neuralmagic left a comment

alexm-neuralmagic May 14, 2024

alexm-neuralmagic May 14, 2024

alexm-neuralmagic May 14, 2024

alexm-neuralmagic commented May 14, 2024 •

edited

bnellnm May 14, 2024

bnellnm May 14, 2024

bnellnm May 14, 2024

bnellnm May 14, 2024

bnellnm commented May 14, 2024

alexm-neuralmagic commented May 14, 2024

jinzhen-lin commented May 16, 2024

alexm-neuralmagic commented May 16, 2024 •

edited

robertgshaw2-neuralmagic commented May 16, 2024

		@@ -38,6 +42,7 @@ constexpr int div_ceil(int a, int b) { return (a + b - 1) / b; }
		// No support for async
		#else

[Kernel] add bfloat16 support for gptq marlin kernel #4788

[Kernel] add bfloat16 support for gptq marlin kernel #4788

Conversation

jinzhen-lin commented May 13, 2024

robertgshaw2-neuralmagic commented May 13, 2024

alexm-neuralmagic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jinzhen-lin commented May 14, 2024

alexm-neuralmagic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-neuralmagic commented May 14, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnellnm commented May 14, 2024

alexm-neuralmagic commented May 14, 2024

jinzhen-lin commented May 16, 2024

alexm-neuralmagic commented May 16, 2024 • edited

robertgshaw2-neuralmagic commented May 16, 2024

alexm-neuralmagic commented May 14, 2024 •

edited

alexm-neuralmagic commented May 16, 2024 •

edited