-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add JetMoE model #30005
Add JetMoE model #30005
Conversation
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Feel free to ping me whenever for another review! 🤗 |
Thanks @ArthurZucker. I have updated the code according to your suggestions. I hope the extra comments will make the code more clear. |
Thanks! having a look 😉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 Looks great! Thanks a lot for adressing all the comments and taking it into account! Left 2 nits but good to merge!
Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Failing test is unrelated, should I merge? 🔥 |
Yes! I have tested offline with the following command: |
Congrats for this great work! We'll do a release on Thursday! |
Thanks a lot for the review and comments! @ArthurZucker @gante @younesbelkada |
What does this PR do?
Add support to JetMoE architecture by Yikang Shen and MyShell AI.
JetMoE is a new sparsely activated architecture inspired by the ModuleFormer. Each JetMoE block consists of two MoE layers: a mixture of Attention Heads and a Mixture of MLP Experts. Given the input tokens, JetMoE activates a subset of its experts to process them. This sparse activation schema enables JetMoE to achieve much better training throughput than similar-sized dense models.
Who can review?
@ArthurZucker and @younesbelkada