-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Model][MiniMaxText01] Support MiniMaxText01 model inference #13454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Why do you introduce |
vllm/config.py
Outdated
# Handle minimax model | ||
if hasattr(self.hf_config, "attn_type_list"): | ||
# 1 represents flash attention and 0 represents linear attention | ||
return sum(t == 1 for t in self.hf_config.attn_type_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be handled in the hybrid model case a few lines down
return hidden_states | ||
|
||
|
||
class MiniMaxText01ForCausalLM(nn.Module, HasInnerState): # IsHybrid后续加 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, this should be:
class MiniMaxText01ForCausalLM(nn.Module, HasInnerState, IsHybrid):
Did you hit some issue when adding the IsHybrid
interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there were some issues with the earlier vllm.
Thanks for your above suggestion! We add the logic in the hybrid model case, and it works!
Please review the new commit 530d99a.
Because the internal data structure |
Could you please support the MiniMax VL model as well? I would greatly appreciate it |
Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference? |
This pull request has merge conflicts that must be resolved before it can be |
@zwc163 |
@zifengdexiatian |
Thanks for the answer, I understand that a single machine can only run the quantitative version, and can run a maximum of 2 million tokens at the same time. |
@ZZBoom just checking - are there any blockers on this PR? I plan to review it but it's still marked as draft |
Is there any progress? |
Can you merge this please? |
e863d81
to
1bd32bc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a couple more small questions and comments, but overall I think the PR is looking pretty good and ready to land once those are addressed.
Will there be a followup to simplify the weight loading?
Adding ready to see how the mamba and hybrid integration tests do |
Yes. We will simplify the weight loading in following work. |
- Removed redundant loops for tensor value assignments in the tests, enhancing readability and maintainability. - Streamlined the initialization of key-value caches and input tensors, focusing on essential configurations for clarity. Signed-off-by: qscqesze <[email protected]>
…itespace - Eliminated trailing whitespace in the test file to enhance code cleanliness and maintain consistency in formatting. - This minor adjustment contributes to overall code quality without affecting functionality. Signed-off-by: qscqesze <[email protected]>
… functionality - Removed unused parameter from current_run_tensors method in ConstantSizeCache to simplify its interface. - Updated slope_rate calculation in MiniMaxText01 to handle single-layer scenarios more clearly, enhancing readability. - Adjusted calls to current_run_tensors in MiniMaxText01Model to reflect the updated method signature. Signed-off-by: qscqesze <[email protected]>
Yeah, this looks good to me and aligns with expectations. |
@tlrmchlsmth Hi! I believe our code passes all the tests except for [buildkite/ci/pr/v1-test], which failed due to a |
tests/kernels/test_lightning_attn.py
Outdated
q = torch.zeros(batch_size, num_heads, 1, head_size, dtype=dtype) | ||
k = torch.zeros(batch_size, num_heads, 1, head_size, dtype=dtype) | ||
v = torch.zeros(batch_size, num_heads, 1, head_size, dtype=dtype) | ||
|
||
kv_caches = torch.zeros(batch_size, | ||
num_heads, | ||
head_size, | ||
head_size, | ||
dtype=dtype, | ||
device="cuda") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you've removed the old initialization code, these should all be torch.randn
instead of torch.zeros. Since these tensors are initialized to all zeros, we're not testing anything.
Ditto for the other unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. Fixed it.
…om values - Changed tensor initialization from zeros to random values in the lightning attention test cases to better simulate realistic input scenarios. - This adjustment enhances the robustness of the tests by ensuring varied input distributions. Signed-off-by: qscqesze <[email protected]>
… remove scale factor - Changed the initialization of the key-value cache tensor from random values to zeros for consistency in test scenarios. - Removed the scale factor from the KV outer product calculation to simplify the implementation and enhance clarity. Signed-off-by: qscqesze <[email protected]>
…aled random values - Updated the initialization of query, key, and value tensors in the lightning attention tests to use a base scale factor for random values, enhancing consistency across test scenarios. - Adjusted the initialization of key-value caches to align with the new scaling approach, improving the robustness of the tests. Signed-off-by: qscqesze <[email protected]>
…lity - Adjusted the indentation and formatting of tensor initialization in the lightning attention test cases to enhance code clarity and maintain consistency. - This change focuses on improving the overall structure of the tests without altering their functionality. Signed-off-by: qscqesze <[email protected]>
Hi @tlrmchlsmth . |
I'll take another look at the code tomorrow morning! In the meantime I think you need to merge in main for the failing |
Thanks. I updated the branch already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now! Thank you for the contribution!
Running one more sanity check on my end and then ready to merge
…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]> Signed-off-by: xinyuxiao <[email protected]>
…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]>
…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]>
…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]>
…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]> Signed-off-by: Mu Huai <[email protected]>
Purpose
This PR is intended to support the MiniMaxText01 model inference.
It can run on a single machine with 8xH800 and 8xH20, where a single H800 machine can handle a maximum context input of 2 million tokens, and a single H20 machine can handle a maximum context input of 5 million tokens.
Modifications
request_ids_to_seq_ids
andfinished_requests_ids
.finished_requests_ids
Issue in Consecutive Multi-Batch Inferences: This is a temporary solution for a specific problem, which likely involves state management during multi-batch inferences.Deployment
Default Parameter Startup
python3 -m vllm.entrypoints.api_server \ --model ${MiniMaxText01-Molde-Path} \ --tensor-parallel-size 8 \ --trust-remote-code \ --quantization experts_int8 \ --max_model_len 1000000 \ --dtype bfloat16
H800 TP8, maximum context length 2 million
python3 -m vllm.entrypoints.api_server \ --model ${MiniMax-Text-01-Path} \ --tensor-parallel-size 8 \ --trust-remote-code \ --quantization experts_int8 \ --max_model_len 2048000 \ --gpu_memory_utilization 0.95 \ --max_num_seqs 1 \ --dtype bfloat16
H20 TP8, maximum context length 5 million