-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Kernel] GGUF MoeVec kernel #16780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] GGUF MoeVec kernel #16780
Conversation
Signed-off-by: SzymonOzog <[email protected]>
Signed-off-by: SzymonOzog <[email protected]>
Signed-off-by: SzymonOzog <[email protected]>
Signed-off-by: SzymonOzog <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. This looks reasonable to me once we updated the GGUF kernel tests to cover the MoeVec kernel!
torch::Tensor ggml_moe_a8_vec(torch::Tensor X, // input | ||
torch::Tensor W, // expert weights | ||
torch::Tensor topk_ids, int64_t top_k, | ||
int64_t type, int64_t row, int64_t tokens) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also update the GGUF kernel tests to cover I-Quants with MoeVec kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Updated with I-Quants
when I set max_model_len to 8192, The service will crash when it start
error log
|
It's work well,Below are the benchmark results on 8L40s45GB
Benchmark result:
@SzymonOzog Very fast, thanks for your work |
Signed-off-by: SzymonOzog <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: SzymonOzog <[email protected]>
Can you merge from main to fix docker build? |
Signed-off-by: SzymonOzog <[email protected]>
@DarkLight1337 Merged main |
PTAL at the failing installation test. It seems related to this PR |
@DarkLight1337 So the test seems to use precompiled nightly wheen where the kernel from this PR is not yet present that's why |
@khluu @mgoin @LucasWilkinson any ideas? |
@SzymonOzog Can you merge from main to see if the python-only-installation test still fails? |
Signed-off-by: SzymonOzog <[email protected]>
Updated to main |
Signed-off-by: Isotr0py <[email protected]>
Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]>
Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: SzymonOzog <[email protected]> Signed-off-by: Isotr0py <[email protected]> Co-authored-by: Isotr0py <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>
When we don't have a high expert utilisation this kernel will work much faster than matmul style moe kernel. Also adds better support for I quants