-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mlp_only_layers is more flexible than decoder_sparse_step #30552
Conversation
@amyeroberts hi |
Hi @eigen2017,
From the commit history, it looks like you might have rebased and not force pushed the branch. When rebasing, it's necessary to force push to properly update the remote, as rebasing is effectively re-writing history. |
HI @amyeroberts , thks 4 your guidance |
@amyeroberts @ArthurZucker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @eigen2017, thanks for iterating such that the tests are passing and for opening a PR!
I've added general review comments.
Just so you know, we don't normally accept changes like this to the models: adding in new config arguments for feature which were not present in the original architecture. In particular, if there isn't an associated issue feature request.
I'll let @ArthurZucker decide if this is something that would be wanted for this model
utils/check_docstrings.py
Outdated
@@ -434,6 +434,7 @@ | |||
"QDQBertConfig", | |||
"QDQBertModel", | |||
"QuestionAnsweringPipeline", | |||
"Qwen2MoeConfig", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be added here - needing this is an indication that the docstring isn't correctly formatted
isUseQwen2MoeSparseMoeBlock = True | ||
if layer_idx in config.mlp_only_layers: | ||
isUseQwen2MoeSparseMoeBlock = False | ||
elif config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0: | ||
isUseQwen2MoeSparseMoeBlock = True | ||
else: | ||
isUseQwen2MoeSparseMoeBlock = False | ||
|
||
if isUseQwen2MoeSparseMoeBlock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var name here isn't pythonic and there's a useless else
isUseQwen2MoeSparseMoeBlock = True | |
if layer_idx in config.mlp_only_layers: | |
isUseQwen2MoeSparseMoeBlock = False | |
elif config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0: | |
isUseQwen2MoeSparseMoeBlock = True | |
else: | |
isUseQwen2MoeSparseMoeBlock = False | |
if isUseQwen2MoeSparseMoeBlock: | |
if not (layer_idx in config.mlp_only_layers) and (config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0): |
@@ -95,6 +95,11 @@ class Qwen2MoeConfig(PretrainedConfig): | |||
allow the model to output the auxiliary loss, including load balancing loss and router z-loss. | |||
router_aux_loss_coef (`float`, *optional*, defaults to 0.001): | |||
The aux loss factor for the total loss. | |||
mlp_only_layers ([`int`], *optional*, defaults to []): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mlp_only_layers ([`int`], *optional*, defaults to []): | |
mlp_only_layers (`int`, *optional*, defaults to `[]`): |
Indicate which layers use Qwen2MoeMLP rather than Qwen2MoeSparseMoeBlock | ||
integers in list is layer index, from 0 to 23 if we have 24 layers | ||
when mlp_only_layers is empty, decoder_sparse_step decides Qwen2MoeMLP or Qwen2MoeSparseMoeBlock | ||
when mlp_only_layers is not empty, decoder_sparse_step becomes invalid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be reworded - it doesn't parse
@amyeroberts HI, many thanks to all the reviews! i saw so many associated feature or issue request during my research on this, someone even requires vllm to support cpu offload inference just like deepspeed, btw deepspeed recently merged my code, i have 10 years experiences in AI research, recently i made huggingface framework work on ascend NPUs(HW gpu). pls give a deep thought to this pr and conversations before decided merg or not , i sincerely want to contribute to great huggingface, thanks again.^_^ |
@amyeroberts HI~ code modified acording to your suggestion, thks!! and ci errors cleared again. |
this model is constructed by alibaba as i know , if there are members from alibaba to make a confirmations is welcomed too |
@huybery HI ! as i found, you are qwen member, pls help to check this pr and give some opinion, thks |
@ArthurZucker HI, pls give this a review , thks.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think this is a lot simpler than the steps we used. Could be added for all MoE model, but this one is the only one that needs it for now!
many thanks for this confirmation!! |
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@ArthurZucker @amyeroberts all your reviews are accepted and modified, CI also passed, please help to merg this pr. |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
@ArthurZucker HI.. code updated to your review suggestions.. |
@ArthurZucker @amyeroberts HI~, i finished my senario test and below is the report, i think it's helpful to record here.
overview:
concepts:
conclusions:
table above "cut exp" means cut 4 layers: [12,14,16,18], i also tried --enforce-eager true for vllm, it's means disable cuda graph feature, to trade generate speed for fewer HBM requirement. then i can cut only one layer's experts(only the 12th layer). generation speed declined to 60 tok/s and acc rised to 53.43% on 2 gpus. it's only my senario using mlp_only_layers, this flexible config field can create many more other usages i think. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating!
it's my honor and pleasure ! |
🤗 |
Before this pr, the config field decoder_sparse_step decides which layers don't have experts, that means it can choose Qwen2MoeSparseMoeBlock or Qwen2MoeMLP for layers.
however, the choose policy is not flexible enough, for example, when decoder_sparse_step = 2:
layers 2, 4, 6, 8... will use Qwen2MoeSparseMoeBlock and layers 1, 3, 5, 7... will use Qwen2MoeMLP.
In this pr, the layer index array “mlp_only_layers” can choose for any layers to use Qwen2MoeSparseMoeBlock or Qwen2MoeMLP, for example , only layer 12 uses Qwen2MoeMLP, and others all use Qwen2MoeSparseMoeBlock.
This support for “mlp_only_layers” has significant importance on poor gpu devices like v100-16G.
As i tested, qwen1.5-moe only requires a little more HBM before cuda OOM, then i only set layer 12 to use Qwen2MoeMLP rather than Qwen2MoeSparseMoeBlock, the model loaded succesfully. (2 pices of v100, and vllm inference)
It's true when set some layers to Qwen2MoeMLP will lose some weights and cause decline of accuracy, but there are solutions:
This pr do not change the model design, just make current feild decoder_sparse_step more flexible.
Only set 1 or 2 or 3 layers to Qwen2MoeMLP can smoothly fit the model to poor HBM, and cost "minimal damage" to the model.