[Model] Deepseek GGUF support #13167

SzymonOzog · 2025-02-12T16:49:18Z

This adds support for quantized deepseek versions from Unsloth:

Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.

To run at the moment one needs to:

download the tokenizer, configuration and modeling files from the original deepseek repo and the config.json from Unsloth GGUF repo.
Change the torch_dtype in config to float16
Merge the weights as instructed in the vLLM docs

When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:

    from vllm import LLM, SamplingParams
    llm = LLM(model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
              tokenizer="/YOUR_PATH/DeepSeek_Unsloth",
              hf_config_path="/YOUR_PATH/DeepSeek_Unsloth",
              enforce_eager=True, tensor_parallel_size=4, trust_remote_code=True,
              max_model_len=10000)
    sampling_params = SamplingParams(temperature=0.5, max_tokens=2000)


    def print_outputs(outputs):
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text\n: {generated_text}")
        print("-" * 80)
    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant"
        },
        {
            "role": "user",
            "content": "Why did the Roman Empire fall?",
        },
    ]
    outputs = llm.chat(conversation,
                       sampling_params=sampling_params,
                       use_tqdm=False)
    print_outputs(outputs)

Current issues:
~~Model loading is very slow as we load experts one by one~~ Fixed
GGUF MoE is a very naive implementation and is very slow

I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this.

Closes #12436

github-actions · 2025-02-12T16:49:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/model_executor/layers/quantization/gguf.py

Isotr0py · 2025-02-13T03:41:07Z

vllm/model_executor/model_loader/loader.py

+            # GGUF layer map assumes that we will have a merged expert weights
+            # so we need to map them manually
+            for idx in range(config.num_hidden_layers):
+                gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = \
+                        f"model.layers.{idx}.mlp.gate.e_score_correction_bias"
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = \
+                        f"model.layers.{idx}.mlp.experts.$EXP_ID$.down_proj.weight"
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = \
+                        f"model.layers.{idx}.mlp.experts.$EXP_ID$.gate_proj.weight"
+                gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = \
+                        f"model.layers.{idx}.mlp.experts.$EXP_ID$.up_proj.weight"


I think we can try to avoid this manual mapping for each weight in MoE, perhaps you can refer to how transformers handle GGUF MoE weights name mapping:
https://github.com/huggingface/transformers/blob/847854b023a637caa18e6860dc2bdd47f7c05eb5/src/transformers/modeling_gguf_pytorch_utils.py#L314-L317

Yeah the problem is that the weight loader expects the experts to be passed in one by one, trying to overcome it atm

Okay managed to add an option to load full expert weights at once to fused moe, still using experts.0 mapping because this is what deepseek_v2::load_weights expects, not sure it that's an issue

junuMoon · 2025-02-13T05:55:01Z

I followed your instruction but I got an error
Tho I'm reading your Pr and it's great job 👍

(VllmWorkerProcess pid=379357) /home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/parameter.py:167: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3683.)
(VllmWorkerProcess pid=379357)   return super().__torch_function__(func, types, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/utils.py", line 2224, in run_method
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     shared_output = self.shared_experts(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 90, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                  ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/linear.py", line 400, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 185, in apply
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     out = _fuse_mul_mat(x, qweight, qweight_type)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 98, in _fuse_mul_mat
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return x @ qweight.T
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ~~^~~~~~~~~~~
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] RuntimeError: size mismatch, got input (10000), mat (10000x7168), vec (0)

SzymonOzog · 2025-02-13T08:45:58Z

I followed your instruction but I got an error Tho I'm reading your Pr and it's great job 👍

(VllmWorkerProcess pid=379357) /home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/parameter.py:167: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3683.)
(VllmWorkerProcess pid=379357)   return super().__torch_function__(func, types, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/utils.py", line 2224, in run_method
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     shared_output = self.shared_experts(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 90, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                  ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/linear.py", line 400, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 185, in apply
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     out = _fuse_mul_mat(x, qweight, qweight_type)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 98, in _fuse_mul_mat
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return x @ qweight.T
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ~~^~~~~~~~~~~
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] RuntimeError: size mismatch, got input (10000), mat (10000x7168), vec (0)

Which of the quantized models are you trying to load?

junuMoon · 2025-02-13T09:19:46Z

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S
this one

SzymonOzog · 2025-02-13T12:06:30Z

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S this one

Just tested and it works for me in a freshly checked out repo, are you sure that you merged gguf weights into one file? Could you share the scripts you are testing with?

chuangzhidan · 2025-02-14T02:51:43Z

met an error：
(base) ubuntu@localhost:/media/data/scripts$ python start_gguf.py
INFO 02-14 10:44:20 init.py:190] Automatically detected platform cuda.
Traceback (most recent call last):
File "/media/data/xgp/scripts/start_gguf.py", line 3, in
llm = LLM(
^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/vllm/utils.py", line 1051, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 212, in init
engine_args = EngineArgs(
^^^^^^^^^^^
TypeError: EngineArgs.init() got an unexpected keyword argument 'hf_config_path'
not sure where went wrong

(base) ubuntu@localhost:/media/data/xgp/scripts$ pip show vllm
Name: vllm
Version: 0.7.2

base_dir="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE"
from vllm import LLM, SamplingParams
llm = LLM(
# model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
model="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/DeepSeek-R1-UD-IQ1_S-merge.gguf",
tokenizer=base_dir,
hf_config_path=base_dir,
enforce_eager=True,
tensor_parallel_size=2,
trust_remote_code=True,
max_model_len=10000
)

/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/
-rw-rw-r-- 1 ubuntu ubuntu 1.6K Feb 14 10:38 config.json
-rw-rw-r-- 1 ubuntu ubuntu 11K Feb 14 10:37 configuration_deepseek.py
-rwxrwxrwx 1 root root 131G Feb 12 17:47 DeepSeek-R1-UD-IQ1_S-merge.gguf*
-rw-rw-r-- 1 ubuntu ubuntu 171 Feb 14 10:37 generation_config.json
-rw-rw-r-- 1 ubuntu ubuntu 74K Feb 14 10:37 modeling_deepseek.py
-rw-rw-r-- 1 ubuntu ubuntu 3.6K Feb 14 10:37 tokenizer_config.json
-rw-rw-r-- 1 ubuntu ubuntu 7.5M Feb 14 10:37 tokenizer.json

SzymonOzog · 2025-02-14T04:20:25Z

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:

root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

zlh1992 · 2025-02-15T07:29:15Z

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:
root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

Please show your detail environment?

accupham · 2025-02-16T20:52:56Z

Is this faster than llama.cpp for the unsloth quants? The llama.cpp version is also very unoptimized-- the GPUs sit mostly idle. Very eager to see it running on VLLM.

seven1122 · 2025-02-17T07:47:05Z

when will be merged?

chuangzhidan · 2025-02-17T07:59:10Z

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:
root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled
Please show your detail environment?

u are right ,it has something to do with vllm version and this pr environment.thank u

seven1122 · 2025-02-18T03:17:27Z

met a KeyError: 'model.embed_tokens.qweight_type'

Exception in worker VllmWorkerProcess while processing method load_model.
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist - packages/vllm/executor/multiproc_worker_utils.py", line
output = run_method(worker, method, args, kwargs)

File "/usr/local/lib/python3.12/dist - packages/vllm/utils.py", line 2220, in run_method
return func(*args, **kwargs)

File "/usr/local/lib/python3.12/dist - packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist - packages/vllm/worker/model_runner.py", line 1112, in
self.model = get_model(vllm_config=self.vllm_config)

File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/model_loader/init.py",
return loader.load_model(vllm_config=vllm_config)

File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/model_loader/loader.py",
model.load_weights()
File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/models/deepseek_v2.py",
param = params_dict[name]

KeyError: 'model.embed_tokens.qweight_type'

zh-jp · 2025-02-18T08:22:46Z

I try to reproduce this PR and raise same error like @seven1122 .

[rank0]:   File "/home/X/new-vllm/vllm/worker/worker.py", line 183, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/X/new-vllm/vllm/worker/model_runner.py", line 1112, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/model_loader/loader.py", line 1320, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/models/deepseek_v2.py", line 808, in load_weights
[rank0]:     param = params_dict[name]
[rank0]:             ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.embed_tokens.qweight_type'

The checkpoints I used is DeepSeek-R1-UD-IQ1_S

I merged multi .gguf files to single by:

./llama-gguf-split --merge ~/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf single.gguf

File path ~/DeepSeek-R1-UD-IQ1_S includes :

DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf  DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf  DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf

leolmj · 2025-02-18T09:39:54Z

i raise the same error as @seven1122

SzymonOzog · 2025-02-18T09:49:15Z

@leolmj @seven1122 @zh-jp

I'm having trouble reproducing the issue, could you share:

your config.json
your vllm version + the commit hash that you checked out
the script that you are runnning

zh-jp · 2025-02-18T14:14:41Z

Hello @SzymonOzog .

config.json
The version of vllm is 0.7.2, and I have replaced the files that were specified in this PR.
Regarding the script, I used the demo you provided and only changed the parameter model that passed in the .gguf file path.

SzymonOzog · 2025-02-18T16:27:26Z

@zh-jp
You also need to change your dtype in config from bfloat16 to float16. Also could you check out this PR and run it through it? There have been changes in vllm since 0.7.2 and I cannot promise backwards compatibility

davidsyoung · 2025-02-18T17:43:28Z

I want to run this, but unfortunately I only have 14x3090 GPU's, so for tensor parallelism I need another 2 GPU's to get to 16. It would be great to see any kind of benchmarks on this compared to llama.cpp. Thank you!

zh-jp · 2025-02-19T09:06:10Z

@SzymonOzog thanks for your valuable suggestions. I build the vllm from deepseek-gguf branch from your vllm repo. And successfully execute DeepSeek-R1-UD-IQ1_S in 4 NVIDIA A800-SXM4-80GB

davidsyoung · 2025-02-19T09:32:58Z

@SzymonOzog thanks for your valuable suggestions. I install the vllm from deepseek-gguf branch from your vllm repo. And successfully execute DeepSeek-R1-UD-IQ1_S in 4 NVIDIA A800-SXM4-80GB

Do you have a benchmark of performance?

slr1997 · 2025-02-19T09:42:43Z

@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least?

junuMoon · 2025-02-19T13:10:28Z

@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least?

INFO 02-19 22:08:59 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

Based on 8 x A100 GPUs, it's showing around 7 tokens/s

SzymonOzog · 2025-03-17T04:41:43Z

I also had issues with long context but that got resolved after switching to bfloat16. This was caused by model outputing NaN after some accumulation and causing a token 0 to get emited(begin of sequence)

SzymonOzog · 2025-03-17T04:43:21Z

@SzymonOzog Would it be possible to support i-matrix quants? It would be really useful to squeeze as much performance as possible! Thank you

No plans at the moment, I'm using Q_4_K and plan to invest time mostly in improving Q_K quants

davidsyoung · 2025-03-17T09:50:37Z

I also had issues with long context but that got resolved after switching to bfloat16. This was caused by model outputing NaN after some accumulation and causing a token 0 to get emited(begin of sequence)

Seems to be happening still to me with bfloat16.

Could you give me your run command, and what commit you’re on?

Seems to happen later in context. Have tried a couple of PR’s as well, with no luck. Have tried to redownload the quant, my own quant, etc.

davidsyoung · 2025-03-18T13:45:58Z

Hey @SzymonOzog - I'm still having the same problem unfortunately without being any closer to resolving exactly why. Would it be possible to get a gentle nudge in the right direction on what I could look for, or what commit I could run to test?

It's exactly the same as you're saying, it seems like a 0 token is emitted after a while and the topic changes to something completely different towards the end of a response.

SzymonOzog · 2025-03-18T19:50:41Z

@davidsyoung I'm running on #14666 with no changes to the default except tensor_parallel=8 and max_model_len=20000 but that's just my memory limitations

joshuakoh1 · 2025-03-19T21:05:59Z

What do I need to run this with native 0.8.0 version?

2025-03-19T20:59:41+00:00 - gpustack.worker.backends.vllm - ERROR - Failed to derive max model length: 
2025-03-19T20:59:41+00:00 - gpustack.worker.backends.vllm - INFO - Starting vllm server
INFO 03-19 20:59:46 [__init__.py:256] Automatically detected platform cuda.
INFO 03-19 20:59:47 [api_server.py:977] vLLM API server version 0.8.0
INFO 03-19 20:59:47 [api_server.py:978] args: Namespace(subparser='serve', model_tag='/mnt/shared/merged_models/DeepSeek-R1-UD-IQ1_M/DeepSeek-R1-UD-IQ1_M.gguf', config='', host='0.0.0.0', port=40600, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/shared/merged_models/DeepSeek-R1-UD-IQ1_M/DeepSeek-R1-UD-IQ1_M.gguf', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=15000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=12, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=5.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='gguf', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1-iq1m-merged'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7e0df3a7e480>)
Traceback (most recent call last):
  File "/var/lib/gpustack/bin/vllm_v0.8.0", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd
    uvloop.run(run_server(args))
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1012, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 141, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 161, in build_async_engine_client_from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1206, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1121, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/config.py", line 333, in __init__
    hf_config = get_config(self.hf_config_path or self.model,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 280, in get_config
    config_dict, _ = PretrainedConfig.get_config_dict(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/configuration_utils.py", line 594, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/configuration_utils.py", line 685, in _get_config_dict
    config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 399, in load_gguf_checkpoint
    raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.")
ValueError: GGUF model with architecture deepseek2 is not supported yet.

joshuakoh1 · 2025-04-02T06:06:55Z

What do I need to run this with native 0.8.0 version?

2025-03-19T20:59:41+00:00 - gpustack.worker.backends.vllm - ERROR - Failed to derive max model length: 
2025-03-19T20:59:41+00:00 - gpustack.worker.backends.vllm - INFO - Starting vllm server
INFO 03-19 20:59:46 [__init__.py:256] Automatically detected platform cuda.
INFO 03-19 20:59:47 [api_server.py:977] vLLM API server version 0.8.0
INFO 03-19 20:59:47 [api_server.py:978] args: Namespace(subparser='serve', model_tag='/mnt/shared/merged_models/DeepSeek-R1-UD-IQ1_M/DeepSeek-R1-UD-IQ1_M.gguf', config='', host='0.0.0.0', port=40600, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/shared/merged_models/DeepSeek-R1-UD-IQ1_M/DeepSeek-R1-UD-IQ1_M.gguf', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=15000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=12, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=5.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='gguf', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1-iq1m-merged'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7e0df3a7e480>)
Traceback (most recent call last):
  File "/var/lib/gpustack/bin/vllm_v0.8.0", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd
    uvloop.run(run_server(args))
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1012, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 141, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 161, in build_async_engine_client_from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1206, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1121, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/config.py", line 333, in __init__
    hf_config = get_config(self.hf_config_path or self.model,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 280, in get_config
    config_dict, _ = PretrainedConfig.get_config_dict(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/configuration_utils.py", line 594, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/configuration_utils.py", line 685, in _get_config_dict
    config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 399, in load_gguf_checkpoint
    raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.")
ValueError: GGUF model with architecture deepseek2 is not supported yet.

@SzymonOzog any ideas? Already passing the HF config files

SzymonOzog · 2025-04-02T07:30:20Z

@joshuakoh1
How are you passing the hf_config_path variable? It's set to none in your logs

Signed-off-by: Louis Ulmer <[email protected]>

lv03 · 2025-04-16T06:31:06Z

@SzymonOzog hello， I encountered some issues while loading DeepSeeker R1-UD-IQ1_S

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto

It seems to be stuck here

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 124582 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 1 N/A N/A 124955 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 2 N/A N/A 124956 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 3 N/A N/A 124957 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 4 N/A N/A 124958 C ...niconda3/envs/vLLM/bin/python 33786MiB |
| 5 N/A N/A 124959 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 6 N/A N/A 124960 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 7 N/A N/A 124961 C ...niconda3/envs/vLLM/bin/python 33786MiB |
+-----------------------------------------------------------------------------------------+

zhaotyer · 2025-04-17T10:17:27Z

@SzymonOzog hello， I encountered some issues while loading DeepSeeker R1-UD-IQ1_S

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto

It seems to be stuck here

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:17:00.0 Off | Off | | 49% 37C P0 78W / 450W | 33746MiB / 49140MiB | 36% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:3D:00.0 Off | Off | | 62% 36C P0 84W / 450W | 33746MiB / 49140MiB | 36% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 Off | 00000000:63:00.0 Off | Off | | 64% 36C P0 81W / 450W | 33746MiB / 49140MiB | 34% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 4090 Off | 00000000:99:00.0 Off | Off | | 62% 38C P0 94W / 450W | 33746MiB / 49140MiB | 33% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA A100 80GB PCIe Off | 00000000:AB:00.0 Off | 0 | | N/A 43C P0 79W / 300W | 33795MiB / 81920MiB | 48% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 4090 Off | 00000000:BD:00.0 Off | Off | | 62% 36C P0 89W / 450W | 33746MiB / 49140MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce RTX 4090 Off | 00000000:CF:00.0 Off | Off | | 65% 40C P0 79W / 450W | 33746MiB / 49140MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA A100 80GB PCIe Off | 00000000:E1:00.0 Off | 0 | | N/A 44C P0 85W / 300W | 33795MiB / 81920MiB | 49% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 124582 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 1 N/A N/A 124955 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 2 N/A N/A 124956 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 3 N/A N/A 124957 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 4 N/A N/A 124958 C ...niconda3/envs/vLLM/bin/python 33786MiB | | 5 N/A N/A 124959 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 6 N/A N/A 124960 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 7 N/A N/A 124961 C ...niconda3/envs/vLLM/bin/python 33786MiB | +-----------------------------------------------------------------------------------------+

me too, Have you solved this problem?

zhaotyer · 2025-04-17T10:20:31Z

@joshuakoh1 How are you passing the hf_config_path variable? It's set to none in your logs

vllm stuck at There is no support for fast MoE kernel for current quantization method. Falling back to slow implementation.
vllm version is: 0.8.4

SzymonOzog · 2025-04-17T10:46:48Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

zhaotyer · 2025-04-17T11:07:33Z

@joshuakoh1 How are you passing the hf_config_path variable? It's set to none in your logs

vllm stuck at There is no support for fast MoE kernel for current quantization method. Falling back to slow implementation. vllm version is: 0.8.4

@SzymonOzog Is there a solution to this problem?

SzymonOzog · 2025-04-17T11:15:56Z

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

zhaotyer · 2025-04-17T11:30:08Z

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

Thank you for your reply, looking forward to merging the code

lv03 · 2025-04-17T12:41:51Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants我认为发生这种情况是因为 I 量化使用非常慢的 moe 实现，并且对于序列长度为 102400，它需要处理很长时间， #16780 应该增加对 MoE I 量化的更好支持

After loading, the following problem occurred，

I saw someone reported this issue before. If don't change I quants，How should I deal with this problem?

SzymonOzog · 2025-04-17T12:52:58Z

For now you can run with enforce_eager=True although it will be slow, the PR I mentioned above should also fix this issue

lv03 · 2025-04-17T12:57:52Z

For now you can run with enforce_eager=True although it will be slow, the PR I mentioned above should also fix this issue现在你可以用 enforce_eager=True 运行，虽然它会很慢，但我上面提到的 PR 也应该可以解决这个问题

thank you，Will the next version of vllm solve this problem?

SzymonOzog · 2025-04-17T18:10:15Z

That depends on when the PR will get merged onto main

zhaotyer · 2025-04-22T05:03:28Z

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

Hello, could you provide a docker images url? The network here is not good and docker build always fails

zhaotyer · 2025-04-22T11:23:10Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start
I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

zhaotyer · 2025-04-22T12:26:34Z

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

when I set max_model_len to 8192,The specific parameter information when the following command reports an error is as follows

    return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
                                        tokens)

(VllmWorkerProcess pid=3264) INFO 04-22 05:19:17 [model_runner.py:1146] Model loading took 66.5477 GiB and 241.765538 seconds
INFO 04-22 05:19:32 [model_runner.py:1146] Model loading took 66.5477 GiB and 257.481310 seconds
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([8192, 7168]), w is:torch.Size([256, 2048, 1400]), topk_ids is:torch.Size([8192, 8]),top_k is:8, quant_type is:19, row is:<class 'torch.SymInt'>, tokens is:8192
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([65536, 1024]), w is:torch.Size([256, 7168, 264]), topk_ids is:torch.Size([8192, 8]),top_k is:1, quant_type is:16, row is:<class 'torch.SymInt'>, tokens is:65536
ERROR 04-22 05:19:37 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 05:19:37 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 05:19:37 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 05:19:37 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 05:19:37 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 0

@SzymonOzog

hahmad2008 · 2025-05-05T15:01:45Z

@SzymonOzog
can we serve this model using vllm?
NoelJacob/Meta-Llama-3-8B-Instruct-Q4_K_M-GGUF

SzymonOzog requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners February 12, 2025 16:49

SzymonOzog mentioned this pull request Feb 12, 2025

[Usage]: Does DeepSeek-R1 1.58-bit Dynamic Quant work on VLLM? #12573

Open

1 task

SzymonOzog force-pushed the gguf-deepseek branch from aec8431 to 1038380 Compare February 12, 2025 16:55

Isotr0py reviewed Feb 13, 2025

View reviewed changes

jeejeelee mentioned this pull request Feb 18, 2025

[Feature]: DeepSeek-R1-UD-IQ1_S(1.58bit-guff) support requirement #13447

Closed

1 task

huang-junhong approved these changes Feb 18, 2025

View reviewed changes

Zeppelinpp mentioned this pull request Mar 21, 2025

[Bug]: GGUF model with architecture deepseek2 is not supported yet while vllm version is 0.8.1 #15277

Closed

1 task

davidsyoung mentioned this pull request Mar 22, 2025

[Bug][V0][Trition MLA][GGUF]: Deepseek R1 GGUF starts producing gibberish towards the end of a longer generation #15340

Open

1 task

thies1006 mentioned this pull request Mar 27, 2025

[Bug]: size mismatch when loading MixtralForCausalLM GGUF model #14423

Open

1 task

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Model] Deepseek GGUF support (vllm-project#13167)

9e743e7

Signed-off-by: Louis Ulmer <[email protected]>

harryzwh mentioned this pull request Apr 15, 2025

FEAT: add ggufv2 support for vLLM xorbitsai/inference#3259

Merged

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Model] Deepseek GGUF support (vllm-project#13167)

737c2f6

Uh oh!

[Model] Deepseek GGUF support #13167

[Model] Deepseek GGUF support #13167

Uh oh!

Conversation

SzymonOzog commented Feb 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 12, 2025

Uh oh!

Uh oh!

Isotr0py Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

SzymonOzog Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

SzymonOzog Feb 13, 2025

Choose a reason for hiding this comment

Uh oh!

junuMoon commented Feb 13, 2025

Uh oh!

SzymonOzog commented Feb 13, 2025

Uh oh!

junuMoon commented Feb 13, 2025

Uh oh!

SzymonOzog commented Feb 13, 2025

Uh oh!

chuangzhidan commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SzymonOzog commented Feb 14, 2025

Uh oh!

zlh1992 commented Feb 15, 2025

Uh oh!

accupham commented Feb 16, 2025

Uh oh!

seven1122 commented Feb 17, 2025

Uh oh!

chuangzhidan commented Feb 17, 2025

Uh oh!

seven1122 commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zh-jp commented Feb 18, 2025

Uh oh!

leolmj commented Feb 18, 2025

Uh oh!

SzymonOzog commented Feb 18, 2025

Uh oh!

zh-jp commented Feb 18, 2025

Uh oh!

SzymonOzog commented Feb 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidsyoung commented Feb 18, 2025

Uh oh!

zh-jp commented Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidsyoung commented Feb 19, 2025

Uh oh!

slr1997 commented Feb 19, 2025

Uh oh!

junuMoon commented Feb 19, 2025

Uh oh!

SzymonOzog commented Mar 17, 2025

Uh oh!

SzymonOzog commented Mar 17, 2025

Uh oh!

davidsyoung commented Mar 17, 2025

Uh oh!

davidsyoung commented Mar 18, 2025

Uh oh!

SzymonOzog commented Mar 18, 2025

Uh oh!

joshuakoh1 commented Mar 19, 2025

Uh oh!

joshuakoh1 commented Apr 2, 2025

Uh oh!

SzymonOzog commented Feb 12, 2025 •

edited by github-actions bot

Loading

chuangzhidan commented Feb 14, 2025 •

edited

Loading

seven1122 commented Feb 18, 2025 •

edited

Loading

SzymonOzog commented Feb 18, 2025 •

edited

Loading

zh-jp commented Feb 19, 2025 •

edited

Loading