Add GPTQ Marlin 2:4 sparse structured support #4790

alexm-neuralmagic · 2024-05-13T15:21:19Z

This PR adds a new GPTQ Marlin 2:4 sparse structured GPU kernel and a support to run 2:4 sparse models in vllm. Currently supported configs are:

group_size of 128 or -1 (channel-wise)
4-bit or 8-bit

The new 2:4 sparse marlin GPU kernel is based on the great work of @LopezCastroRoberto and @dalistarh from @IST-Das. More information will be provided in their upcoming publication.

TODO:

Provide a tutorial that describes how to generate 2:4 sparse models.

alexm-neuralmagic · 2024-05-13T15:23:22Z

Benchmark results on A100 for Yi-34B Chat model that has marlin_24 serialized weights (where the actual weight values are not real yet). This is just to show preliminary results to get a feeling of how it compares vs original Marlin, GPTQ and fp16.

Original PDF:

Yi-34B-Chat-GPTQ-vs-Marlin-vs-Marlin24.pdf

pcmoritz · 2024-05-14T16:53:15Z

vllm/config.py

@@ -160,6 +160,9 @@ def _verify_quantization(self) -> None:
            is_format_marlin = (quant_cfg.get("checkpoint_format") == "marlin"
                                or quant_cfg.get("is_marlin_format", False))

+            is_format_marlin_24 = (


We should think about how to clean this up and not have this marlin specific code in vllm/config.py.

One way to do it that doesn't require more registries: Have an optional class variable checkpoint_format in the gptq compatible QuantizationConfigs and then in this code, iterate through QUANTIZATION_METHODS and see if one of them has the associated checkpoint format.

Good suggestion. I have changed the code to encapsulate the marlin specific checkpoint checks into marlin config classes. Tell me if it looks good now.

I'm thinking about something a little more radical like replacing this whole code block with

for name, method in QUANTIZATION_METHODS.items(): if method.supports_checkpoint(quant_cfg): self.quantization = name

and you would have a default implementation of supports_checkpoint for QuantizationConfig that returns False, and Marlin would implement the method, print the appropriate warnings and return True if the quantization should be overridden.

That way you can remove all occurrences of marlin from the config.py file, and this mechanism can also be used by other quantization schemes :)

I see, I can try it

@pcmoritz I have redid the config.py part that you proposed to change. It looks cleaner now :)

pcmoritz

The PR looks good to me (I didn't review the kernel code in detail though). Do you know how much it adds to the binary size? We need to be careful to not increase that too much due to pypi limitations.

robertgshaw2-neuralmagic · 2024-05-14T19:48:56Z

The PR looks good to me (I didn't review the kernel code in detail though). Do you know how much it adds to the binary size? We need to be careful to not increase that too much due to pypi limitations.

I think we have a test for this in the CI

pcmoritz · 2024-05-15T19:28:49Z

Nice, thanks for making these changes, this looks a bunch cleaner now!

Optional suggestion that would be even cleaner: Rename supports_checkpoint to override_quantization_method with a different signature (see below) and also make it clearer in the doc string that this is not something that other quantization configs need to necessarily implement? Something like

"""Detects if this quantization method can support a given checkpoint format by overriding the user specified quantization method -- this method should only be overwritten by subclasses in exceptional circumstances"""

The override_quantization_method function should normally return None or if an override occurs, a str with the new quantization type, i.e.

            # Detect which checkpoint is it
            for name, method in QUANTIZATION_METHODS.items():
                quantization_override = method.override_quantization_method(quant_cfg):
                if quantization_override:
                    self.quantization = quantization_override
                    break

This would enable you to shift the following logic into override_quantization_method too :)

            # Allow override of gptq_marlin to gptq (if set explicitly)
            if self.quantization == "gptq" and quant_method == "gptq_marlin":
                logger.warning(
                    "Detected that the model can run with gptq_marlin"
                    ", however you specified quantization=gptq explicitly,"
                    " so forcing gptq. Use quantization=gptq_marlin for"
                    " faster inference")
                quant_method = "gptq"

            # Choose gptq_marlin if marlin is specified
            if self.quantization == "marlin" and quant_method == "gptq_marlin":
                self.quantization = quant_method

            # Choose marlin if gptq is specified
            if self.quantization == "gptq" and quant_method == "marlin":
                self.quantization = quant_method

alexm-neuralmagic · 2024-05-15T20:06:02Z

@pcmoritz This is good idea. Changed the API to return str or None and moved the gptq specific override logic to the override funcs.

pcmoritz · 2024-05-15T20:27:18Z

vllm/model_executor/layers/quantization/__init__.py

    "fp8": Fp8Config,
+    # The order of gptq methods is important for config.py iteration over
+    # supports_checkpoint(..)


Nit: This is called override_quantization_method now :)

pcmoritz · 2024-05-15T20:29:26Z

Wonderful! Small nit and then it looks good to go if the tests pass :)

alexm-neuralmagic · 2024-05-15T20:36:35Z

Cool, fixed the nit and some other little things.

alexm-neuralmagic · 2024-05-15T20:37:01Z

Thanks for the suggestions!

Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

yzlnew · 2024-06-03T11:41:36Z

The 2:4 sparse without quantization is currently not supported in vLLM yet, right?

mgoin · 2024-06-03T13:04:31Z

@yzlnew That is correct, currently GPTQ quantization is required

Add marlin 2:4 sparse structured support

d6eb9f1

format

efc76c0

alexm-neuralmagic mentioned this pull request May 13, 2024

[Kernel] Add Marlin 2:4 sparse structured support neuralmagic/nm-vllm#228

Closed

alexm-neuralmagic and others added 4 commits May 13, 2024 15:31

spelling

3591ca1

fixed marlin bug

a070fe9

newlines

d0d09c3

newline in cuda kernel

56409a1

Qubitium mentioned this pull request May 14, 2024

[FEATURE] Add marlin24 support AutoGPTQ/AutoGPTQ#672

Open

pcmoritz reviewed May 14, 2024

View reviewed changes

pcmoritz approved these changes May 14, 2024

View reviewed changes

alexm-neuralmagic added 2 commits May 14, 2024 19:44

Philip comments

56de762

fix

1a45ac1

alexm-neuralmagic added 5 commits May 15, 2024 18:30

Refactor config.py gptq selection based on Philip's suggestion

337a1f0

remove redundant vars

abf4cf3

format

26c8ebb

cleanup

f6c50c0

fix

224b1c8

Encapsulate gptq specifics checks inside each GPTQ config

653e0b8

pcmoritz reviewed May 15, 2024

View reviewed changes

final comments

b1399ee

pcmoritz approved these changes May 15, 2024

View reviewed changes

add H100 bug fix

cb1c552

robertgshaw2-neuralmagic merged commit 6979ade into vllm-project:main May 16, 2024
55 checks passed

robertgshaw2-neuralmagic deleted the marlin_24_sparse branch May 16, 2024 16:56

robertgshaw2-neuralmagic added a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024

Add GPTQ Marlin 2:4 sparse structured support (vllm-project#4790)

cf4926d

Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024

Add GPTQ Marlin 2:4 sparse structured support (vllm-project#4790)

1b7a015

Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

tybalex pushed a commit to tybalex/vllm-function-call that referenced this pull request May 25, 2024

Add GPTQ Marlin 2:4 sparse structured support (vllm-project#4790)

0f07aca

Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024

Add GPTQ Marlin 2:4 sparse structured support (vllm-project#4790)

3f9a6a2

Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPTQ Marlin 2:4 sparse structured support #4790

Add GPTQ Marlin 2:4 sparse structured support #4790

alexm-neuralmagic commented May 13, 2024

alexm-neuralmagic commented May 13, 2024

pcmoritz May 14, 2024 •

edited

alexm-neuralmagic May 14, 2024

pcmoritz May 14, 2024 •

edited

alexm-neuralmagic May 14, 2024

alexm-neuralmagic May 15, 2024

pcmoritz left a comment

robertgshaw2-neuralmagic commented May 14, 2024

pcmoritz commented May 15, 2024 •

edited

alexm-neuralmagic commented May 15, 2024

pcmoritz May 15, 2024

pcmoritz commented May 15, 2024

alexm-neuralmagic commented May 15, 2024

alexm-neuralmagic commented May 15, 2024

yzlnew commented Jun 3, 2024

mgoin commented Jun 3, 2024

Add GPTQ Marlin 2:4 sparse structured support #4790

Add GPTQ Marlin 2:4 sparse structured support #4790

Conversation

alexm-neuralmagic commented May 13, 2024

alexm-neuralmagic commented May 13, 2024

pcmoritz May 14, 2024 • edited

Choose a reason for hiding this comment

alexm-neuralmagic May 14, 2024

Choose a reason for hiding this comment

pcmoritz May 14, 2024 • edited

Choose a reason for hiding this comment

alexm-neuralmagic May 14, 2024

Choose a reason for hiding this comment

alexm-neuralmagic May 15, 2024

Choose a reason for hiding this comment

pcmoritz left a comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented May 14, 2024

pcmoritz commented May 15, 2024 • edited

alexm-neuralmagic commented May 15, 2024

pcmoritz May 15, 2024

Choose a reason for hiding this comment

pcmoritz commented May 15, 2024

alexm-neuralmagic commented May 15, 2024

alexm-neuralmagic commented May 15, 2024

yzlnew commented Jun 3, 2024

mgoin commented Jun 3, 2024

pcmoritz May 14, 2024 •

edited

pcmoritz May 14, 2024 •

edited

pcmoritz commented May 15, 2024 •

edited