Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPTQ Marlin 2:4 sparse structured support #4790

Merged

Conversation

alexm-neuralmagic
Copy link
Contributor

This PR adds a new GPTQ Marlin 2:4 sparse structured GPU kernel and a support to run 2:4 sparse models in vllm. Currently supported configs are:

  1. group_size of 128 or -1 (channel-wise)
  2. 4-bit or 8-bit

The new 2:4 sparse marlin GPU kernel is based on the great work of @LopezCastroRoberto and @dalistarh from @IST-Das. More information will be provided in their upcoming publication.

TODO:

  1. Provide a tutorial that describes how to generate 2:4 sparse models.

@alexm-neuralmagic
Copy link
Contributor Author

Benchmark results on A100 for Yi-34B Chat model that has marlin_24 serialized weights (where the actual weight values are not real yet). This is just to show preliminary results to get a feeling of how it compares vs original Marlin, GPTQ and fp16.

image

Original PDF:

Yi-34B-Chat-GPTQ-vs-Marlin-vs-Marlin24.pdf

vllm/config.py Outdated
@@ -160,6 +160,9 @@ def _verify_quantization(self) -> None:
is_format_marlin = (quant_cfg.get("checkpoint_format") == "marlin"
or quant_cfg.get("is_marlin_format", False))

is_format_marlin_24 = (
Copy link
Collaborator

@pcmoritz pcmoritz May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should think about how to clean this up and not have this marlin specific code in vllm/config.py.

One way to do it that doesn't require more registries: Have an optional class variable checkpoint_format in the gptq compatible QuantizationConfigs and then in this code, iterate through QUANTIZATION_METHODS and see if one of them has the associated checkpoint format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. I have changed the code to encapsulate the marlin specific checkpoint checks into marlin config classes. Tell me if it looks good now.

Copy link
Collaborator

@pcmoritz pcmoritz May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about something a little more radical like replacing this whole code block with

for name, method in QUANTIZATION_METHODS.items():
    if method.supports_checkpoint(quant_cfg):
        self.quantization = name

and you would have a default implementation of supports_checkpoint for QuantizationConfig that returns False, and Marlin would implement the method, print the appropriate warnings and return True if the quantization should be overridden.

That way you can remove all occurrences of marlin from the config.py file, and this mechanism can also be used by other quantization schemes :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I can try it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pcmoritz I have redid the config.py part that you proposed to change. It looks cleaner now :)

Copy link
Collaborator

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks good to me (I didn't review the kernel code in detail though). Do you know how much it adds to the binary size? We need to be careful to not increase that too much due to pypi limitations.

@robertgshaw2-neuralmagic
Copy link
Collaborator

The PR looks good to me (I didn't review the kernel code in detail though). Do you know how much it adds to the binary size? We need to be careful to not increase that too much due to pypi limitations.

I think we have a test for this in the CI

@pcmoritz
Copy link
Collaborator

pcmoritz commented May 15, 2024

Nice, thanks for making these changes, this looks a bunch cleaner now!

Optional suggestion that would be even cleaner: Rename supports_checkpoint to override_quantization_method with a different signature (see below) and also make it clearer in the doc string that this is not something that other quantization configs need to necessarily implement? Something like

"""Detects if this quantization method can support a given checkpoint format by overriding the user specified quantization method -- this method should only be overwritten by subclasses in exceptional circumstances"""

The override_quantization_method function should normally return None or if an override occurs, a str with the new quantization type, i.e.

            # Detect which checkpoint is it
            for name, method in QUANTIZATION_METHODS.items():
                quantization_override = method.override_quantization_method(quant_cfg):
                if quantization_override:
                    self.quantization = quantization_override
                    break

This would enable you to shift the following logic into override_quantization_method too :)

            # Allow override of gptq_marlin to gptq (if set explicitly)
            if self.quantization == "gptq" and quant_method == "gptq_marlin":
                logger.warning(
                    "Detected that the model can run with gptq_marlin"
                    ", however you specified quantization=gptq explicitly,"
                    " so forcing gptq. Use quantization=gptq_marlin for"
                    " faster inference")
                quant_method = "gptq"

            # Choose gptq_marlin if marlin is specified
            if self.quantization == "marlin" and quant_method == "gptq_marlin":
                self.quantization = quant_method

            # Choose marlin if gptq is specified
            if self.quantization == "gptq" and quant_method == "marlin":
                self.quantization = quant_method

@alexm-neuralmagic
Copy link
Contributor Author

@pcmoritz This is good idea. Changed the API to return str or None and moved the gptq specific override logic to the override funcs.

"fp8": Fp8Config,
# The order of gptq methods is important for config.py iteration over
# supports_checkpoint(..)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is called override_quantization_method now :)

@pcmoritz
Copy link
Collaborator

Wonderful! Small nit and then it looks good to go if the tests pass :)

@alexm-neuralmagic
Copy link
Contributor Author

Cool, fixed the nit and some other little things.

@alexm-neuralmagic
Copy link
Contributor Author

Thanks for the suggestions!

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic merged commit 6979ade into vllm-project:main May 16, 2024
55 checks passed
@robertgshaw2-neuralmagic robertgshaw2-neuralmagic deleted the marlin_24_sparse branch May 16, 2024 16:56
robertgshaw2-neuralmagic added a commit to neuralmagic/nm-vllm that referenced this pull request May 19, 2024
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request May 21, 2024
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
tybalex pushed a commit to tybalex/vllm-function-call that referenced this pull request May 25, 2024
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
mawong-amd pushed a commit to ROCm/vllm that referenced this pull request Jun 3, 2024
Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>
@yzlnew
Copy link

yzlnew commented Jun 3, 2024

The 2:4 sparse without quantization is currently not supported in vLLM yet, right?

@mgoin
Copy link
Collaborator

mgoin commented Jun 3, 2024

@yzlnew That is correct, currently GPTQ quantization is required

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants