[Model][MiniMaxText01] Support MiniMaxText01 model inference #13454

ZZBoom · 2025-02-18T03:53:02Z

Purpose

This PR is intended to support the MiniMaxText01 model inference.
It can run on a single machine with 8xH800 and 8xH20, where a single H800 machine can handle a maximum context input of 2 million tokens, and a single H20 machine can handle a maximum context input of 5 million tokens.

Modifications

Add the MiniMaxText01 model inference implementation, and a separate cache manager specifically for linear attention.
Adapt to the input consistent with the mamba model, including request_ids_to_seq_ids and finished_requests_ids.
Temporary Fix for the finished_requests_ids Issue in Consecutive Multi-Batch Inferences: This is a temporary solution for a specific problem, which likely involves state management during multi-batch inferences.

Deployment

Default Parameter Startup

python3 -m vllm.entrypoints.api_server \
--model ${MiniMaxText01-Molde-Path} \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 1000000 \
--dtype bfloat16

H800 TP8, maximum context length 2 million

python3 -m vllm.entrypoints.api_server \
--model ${MiniMax-Text-01-Path} \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 2048000 \
--gpu_memory_utilization 0.95 \
--max_num_seqs 1 \
--dtype bfloat16

H20 TP8, maximum context length 5 million

python -m vllm.entrypoints.api_server \
--model MiniMaxAI/MiniMax-Text-01 \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8 \
--max_model_len 5120000 \
--gpu_memory_utilization 0.95 \
--max_num_seqs 1 \
--dtype bfloat16

github-actions · 2025-02-18T03:53:13Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

heheda12345 · 2025-02-20T14:30:13Z

Why do you introduce minimax_cache.py instead of reusing mamba_cache.py?

vllm/model_executor/models/minimax_cache.py

tlrmchlsmth · 2025-02-20T22:04:01Z

vllm/config.py

+            # Handle minimax model
+            if hasattr(self.hf_config, "attn_type_list"):
+                # 1 represents flash attention and 0 represents linear attention
+                return sum(t == 1 for t in self.hf_config.attn_type_list)


This should be handled in the hybrid model case a few lines down

tlrmchlsmth · 2025-02-20T22:06:35Z

vllm/model_executor/models/minimax_text_01.py

+        return hidden_states
+
+
+class MiniMaxText01ForCausalLM(nn.Module, HasInnerState):     # IsHybrid后续加


IIUC, this should be:
class MiniMaxText01ForCausalLM(nn.Module, HasInnerState, IsHybrid):

Did you hit some issue when adding the IsHybrid interface?

Yes, there were some issues with the earlier vllm.

Thanks for your above suggestion! We add the logic in the hybrid model case, and it works!
Please review the new commit 530d99a.

ZZBoom · 2025-02-21T06:47:48Z

Why do you introduce minimax_cache.py instead of reusing mamba_cache.py?

Because the internal data structure self.mamba_cache in mamba_cache.py is not suitable for the cache of MiniMaxText01 Model Linear Attn, and this parameter is coupled within the current_run_tensors method.

zwc163 · 2025-02-24T07:50:46Z

Could you please support the MiniMax VL model as well? I would greatly appreciate it

zifengdexiatian · 2025-02-25T11:11:15Z

Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference？

mergify · 2025-02-25T14:07:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ZZBoom.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ZZBoom · 2025-02-25T14:46:19Z

Could you please support the MiniMax VL model as well? I would greatly appreciate it

@zwc163
Thank you for your attention. We do not have such a plan in the near future.

ZZBoom · 2025-02-25T14:50:14Z

Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference？

@zifengdexiatian
Two million tokens is not the goal. If you want to run this model on a single machine with 8xH800, you can only use int8 weight-only quantization or lower precision, and this two million is the maximum limit for running in this environment.

zifengdexiatian · 2025-02-25T15:22:03Z

Sorry, this may be a silly question, but is the model used int8 quantized to achieve 2 million contexts using H800 TP8 inference？

@zifengdexiatian

Two million tokens is not the goal. If you want to run this model on a single machine with 8xH800, you can only use int8 weight-only quantization or lower precision, and this two million is the maximum limit for running in this environment.

Thanks for the answer, I understand that a single machine can only run the quantitative version, and can run a maximum of 2 million tokens at the same time.

tlrmchlsmth · 2025-03-04T14:20:59Z

@ZZBoom just checking - are there any blockers on this PR? I plan to review it but it's still marked as draft

shuxiaobo · 2025-03-07T02:16:49Z

Is there any progress?

tugot17 · 2025-03-07T10:29:28Z

Can you merge this please?

tlrmchlsmth

I had a couple more small questions and comments, but overall I think the PR is looking pretty good and ready to land once those are addressed.

Will there be a followup to simplify the weight loading?

tlrmchlsmth · 2025-03-30T21:54:26Z

Adding ready to see how the mamba and hybrid integration tests do

qscqesze · 2025-03-31T02:21:07Z

I had a couple more small questions and comments, but overall I think the PR is looking pretty good and ready to land once those are addressed.

Will there be a followup to simplify the weight loading?

Yes. We will simplify the weight loading in following work.

- Removed redundant loops for tensor value assignments in the tests, enhancing readability and maintainability. - Streamlined the initialization of key-value caches and input tensors, focusing on essential configurations for clarity. Signed-off-by: qscqesze <[email protected]>

…itespace - Eliminated trailing whitespace in the test file to enhance code cleanliness and maintain consistency in formatting. - This minor adjustment contributes to overall code quality without affecting functionality. Signed-off-by: qscqesze <[email protected]>

… functionality - Removed unused parameter from current_run_tensors method in ConstantSizeCache to simplify its interface. - Updated slope_rate calculation in MiniMaxText01 to handle single-layer scenarios more clearly, enhancing readability. - Adjusted calls to current_run_tensors in MiniMaxText01Model to reflect the updated method signature. Signed-off-by: qscqesze <[email protected]>

qscqesze · 2025-03-31T06:55:59Z

Some gsm8k evals on my end. Do these look good to you @qscqesze and @ZZBoom? (Using experts_int8 to fit on a single 8xA100 machine)

Running the following:
vllm serve MiniMaxAI/MiniMax-Text-01 \
--tensor-parallel-size 8 \
--trust-remote-code \
--quantization experts_int8  \
--max_model_len 1000000 \
--dtype bfloat16

lm_eval --model local-completions --tasks gsm8k --model_args model=MiniMaxAI/MiniMax-Text-01,base_url=http://127.0.0.1:8000/v1/completions --limit 100
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.94|±  |0.0239|
|     |       |strict-match    |     5|exact_match|↑  | 0.94|±  |0.0239|
GSM8K results reported in https://huggingface.co/MiniMaxAI/MiniMax-Text-01#3-evaluation are 0.948, so this looks good to me, especially we'll be dropping accuracy a bit from quantization

Yeah, this looks good to me and aligns with expectations.

qscqesze · 2025-03-31T07:33:31Z

@tlrmchlsmth Hi! I believe our code passes all the tests except for [buildkite/ci/pr/v1-test], which failed due to a torch.OutOfMemoryError: CUDA out of memory. This issue doesn’t seem related to our code. Could you take a look and see if it’s ready to be merged?

tlrmchlsmth · 2025-03-31T13:28:25Z

tests/kernels/test_lightning_attn.py

+    q = torch.zeros(batch_size, num_heads, 1, head_size, dtype=dtype)
+    k = torch.zeros(batch_size, num_heads, 1, head_size, dtype=dtype)
+    v = torch.zeros(batch_size, num_heads, 1, head_size, dtype=dtype)
+
+    kv_caches = torch.zeros(batch_size,
+                            num_heads,
+                            head_size,
+                            head_size,
+                            dtype=dtype,
+                            device="cuda")
+


Now that you've removed the old initialization code, these should all be torch.randn instead of torch.zeros. Since these tensors are initialized to all zeros, we're not testing anything.

Ditto for the other unit tests.

thanks. Fixed it.

…om values - Changed tensor initialization from zeros to random values in the lightning attention test cases to better simulate realistic input scenarios. - This adjustment enhances the robustness of the tests by ensuring varied input distributions. Signed-off-by: qscqesze <[email protected]>

… remove scale factor - Changed the initialization of the key-value cache tensor from random values to zeros for consistency in test scenarios. - Removed the scale factor from the KV outer product calculation to simplify the implementation and enhance clarity. Signed-off-by: qscqesze <[email protected]>

…aled random values - Updated the initialization of query, key, and value tensors in the lightning attention tests to use a base scale factor for random values, enhancing consistency across test scenarios. - Adjusted the initialization of key-value caches to align with the new scaling approach, improving the robustness of the tests. Signed-off-by: qscqesze <[email protected]>

…lity - Adjusted the indentation and formatting of tensor initialization in the lightning attention test cases to enhance code clarity and maintain consistency. - This change focuses on improving the overall structure of the tests without altering their functionality. Signed-off-by: qscqesze <[email protected]>

qscqesze · 2025-04-01T02:57:11Z

Hi @tlrmchlsmth .
I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged?
Thanks again!

tlrmchlsmth · 2025-04-01T03:03:13Z

Hi @tlrmchlsmth . I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged? Thanks again!

I'll take another look at the code tomorrow morning! In the meantime I think you need to merge in main for the failing docker-build-image test (related to #14549)

qscqesze · 2025-04-01T06:43:40Z

Hi @tlrmchlsmth . I’ve fixed the comments—thank you for the feedback! However, the test failed due to a missing image. Would you mind helping to restart the test? When you have a moment, could you also take another look at the code to see if it’s ready to be merged? Thanks again!

I'll take another look at the code tomorrow morning! In the meantime I think you need to merge in main for the failing docker-build-image test (related to #14549)

Thanks. I updated the branch already.

tlrmchlsmth

Looks good to me now! Thank you for the contribution!

Running one more sanity check on my end and then ready to merge

…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]> Signed-off-by: xinyuxiao <[email protected]>

…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]>

…oject#13454) Signed-off-by: qscqesze <[email protected]> Co-authored-by: qingjun <[email protected]> Co-authored-by: qscqesze <[email protected]> Signed-off-by: Mu Huai <[email protected]>

ZZBoom marked this pull request as draft February 18, 2025 03:54

youkaichao requested a review from tlrmchlsmth February 18, 2025 06:17

ZZBoom mentioned this pull request Feb 19, 2025

The inference performance of the sample code is very poor MiniMax-AI/MiniMax-01#11

Closed

tlrmchlsmth reviewed Feb 20, 2025

View reviewed changes

vllm/model_executor/models/minimax_cache.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Feb 20, 2025

View reviewed changes

mergify bot added the needs-rebase label Feb 25, 2025

tlrmchlsmth self-assigned this Feb 25, 2025

sriting mentioned this pull request Mar 3, 2025

什么时候支持ollama呢 MiniMax-AI/MiniMax-01#33

Closed

mergify bot added documentation Improvements or additions to documentation ci/build frontend multi-modality Related to multi-modality (#4194) structured-output speculative-decoding v1 labels Mar 13, 2025

qscqesze force-pushed the qinggangying/vllm branch from e863d81 to 1bd32bc Compare March 13, 2025 03:41

mergify bot removed the needs-rebase label Mar 13, 2025

tlrmchlsmth reviewed Mar 30, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 30, 2025

qscqesze requested a review from DarkLight1337 as a code owner March 31, 2025 02:56

qscqesze added 2 commits March 31, 2025 11:02

tlrmchlsmth reviewed Mar 31, 2025

View reviewed changes

qscqesze added 4 commits April 1, 2025 10:21

Merge branch 'vllm-project:main' into qinggangying/vllm

c7d93c1

tlrmchlsmth approved these changes Apr 1, 2025

View reviewed changes

tlrmchlsmth merged commit 9ef98d5 into vllm-project:main Apr 1, 2025
41 checks passed

tlrmchlsmth mentioned this pull request Apr 15, 2025

[Bugfix][Mamba] Fix MambaCache leak #14820

Open

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

tlrmchlsmth mentioned this pull request Apr 24, 2025

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

Open

9 tasks

		return hidden_states


		class MiniMaxText01ForCausalLM(nn.Module, HasInnerState): # IsHybrid后续加

Uh oh!

[Model][MiniMaxText01] Support MiniMaxText01 model inference #13454

[Model][MiniMaxText01] Support MiniMaxText01 model inference #13454

Uh oh!

Conversation

ZZBoom commented Feb 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Modifications

Deployment

Uh oh!

github-actions bot commented Feb 18, 2025

Uh oh!

heheda12345 commented Feb 20, 2025

Uh oh!

Uh oh!

tlrmchlsmth Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

ZZBoom Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

ZZBoom commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zwc163 commented Feb 24, 2025

Uh oh!

zifengdexiatian commented Feb 25, 2025

Uh oh!

mergify bot commented Feb 25, 2025

Uh oh!

ZZBoom commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZZBoom commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zifengdexiatian commented Feb 25, 2025

Uh oh!

tlrmchlsmth commented Mar 4, 2025

Uh oh!

shuxiaobo commented Mar 7, 2025

Uh oh!

tugot17 commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth commented Mar 30, 2025

Uh oh!

qscqesze commented Mar 31, 2025

Uh oh!

qscqesze commented Mar 31, 2025

Uh oh!

qscqesze commented Mar 31, 2025

Uh oh!

tlrmchlsmth Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

qscqesze Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

qscqesze commented Apr 1, 2025

Uh oh!

tlrmchlsmth commented Apr 1, 2025

Uh oh!

qscqesze commented Apr 1, 2025

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ZZBoom commented Feb 18, 2025 •

edited by github-actions bot

Loading

ZZBoom commented Feb 21, 2025 •

edited

Loading

ZZBoom commented Feb 25, 2025 •

edited

Loading

ZZBoom commented Feb 25, 2025 •

edited

Loading

tugot17 commented Mar 7, 2025 •

edited

Loading