[docs] LLM inference #29791

stevhliu · 2024-03-21T21:32:09Z

This PR creates a lightweight LLM inference optimization guide focused on just providing a brief explanation and copy/paste-able code examples so users can instantly get going without necessarily needing to get too in-depth in the underlying conceptual details about how it works. It includes:

static kv-cache + torch.compile
speculative and prompt lookup decoding
attention optimizations (FlashAttention-2 and SDPA)
quantization

HuggingFaceDocBuilderDev · 2024-03-21T21:52:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

docs/source/en/llm_optims.md

stevhliu · 2024-03-26T19:58:01Z

For each of the topics covered, I was wondering if:

@ArthurZucker, would you mind having a look at the static kv + torch.compile section?
@gante, would you mind having a look at the speculative decoding section?
@younesbelkada or @SunMarc, would you mind having a look at the attention optimization and quantization sections?

SunMarc

Thanks for the great work @stevhliu ! I left a couple of suggestions.

docs/source/en/llm_optims.md

ArthurZucker

Thanks!

docs/source/en/llm_optims.md

gante

Very cool 🙌

docs/source/en/llm_optims.md

younesbelkada

Fantastic work ! 🚀

amyeroberts

Looks great - thanks for adding!

docs/source/en/llm_optims.md

ArthurZucker

Sorry for the late review here!

docs/source/en/llm_optims.md

first draft

d75dbfb

tengomucho reviewed Mar 22, 2024

View reviewed changes

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

feedback

c3a4092

khipp reviewed Mar 25, 2024

View reviewed changes

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

BlackSamorez mentioned this pull request Mar 25, 2024

33B llama quantization post-inference time Vahe1994/AQLM#56

Closed

SunMarc approved these changes Mar 27, 2024

View reviewed changes

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

ArthurZucker reviewed Mar 27, 2024

View reviewed changes

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

gante approved these changes Mar 28, 2024

View reviewed changes

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Show resolved Hide resolved

learning-chip mentioned this pull request Apr 4, 2024

No speed-up of model.generate() with StaticCache + torch.compile in 4.39.3 #30055

Closed

4 tasks

younesbelkada approved these changes Apr 5, 2024

View reviewed changes

static cache snippet

5e6f1de

stevhliu requested review from ArthurZucker and amyeroberts April 10, 2024 21:57

amyeroberts approved these changes Apr 19, 2024

View reviewed changes

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Show resolved Hide resolved

docs/source/en/llm_optims.md Show resolved Hide resolved

feedback

Loading
Loading status checks…

1c352fa

ArthurZucker approved these changes Apr 22, 2024

View reviewed changes

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Outdated Show resolved Hide resolved

docs/source/en/llm_optims.md Show resolved Hide resolved

tengomucho approved these changes Apr 22, 2024

View reviewed changes

feedback

Loading
Loading status checks…

2b61fb2

stevhliu merged commit e74d793 into huggingface:main Apr 22, 2024
8 checks passed

stevhliu deleted the llm-optim branch April 22, 2024 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] LLM inference #29791

[docs] LLM inference #29791

stevhliu commented Mar 21, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 21, 2024

stevhliu commented Mar 26, 2024

SunMarc left a comment

ArthurZucker left a comment

gante left a comment

younesbelkada left a comment

amyeroberts left a comment

ArthurZucker left a comment

[docs] LLM inference #29791

[docs] LLM inference #29791

Conversation

stevhliu commented Mar 21, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Mar 21, 2024

stevhliu commented Mar 26, 2024

SunMarc left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

gante left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

stevhliu commented Mar 21, 2024 •

edited

Loading