Skip to content

[Model] Deepseek GGUF support #13167

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Feb 27, 2025
Merged

Conversation

SzymonOzog
Copy link
Contributor

@SzymonOzog SzymonOzog commented Feb 12, 2025

This adds support for quantized deepseek versions from Unsloth:

Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.

To run at the moment one needs to:

When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:

    from vllm import LLM, SamplingParams
    llm = LLM(model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
              tokenizer="/YOUR_PATH/DeepSeek_Unsloth",
              hf_config_path="/YOUR_PATH/DeepSeek_Unsloth",
              enforce_eager=True, tensor_parallel_size=4, trust_remote_code=True,
              max_model_len=10000)
    sampling_params = SamplingParams(temperature=0.5, max_tokens=2000)


    def print_outputs(outputs):
        for output in outputs:
            prompt = output.prompt
            generated_text = output.outputs[0].text
            print(f"Prompt: {prompt!r}, Generated text\n: {generated_text}")
        print("-" * 80)
    conversation = [
        {
            "role": "system",
            "content": "You are a helpful assistant"
        },
        {
            "role": "user",
            "content": "Why did the Roman Empire fall?",
        },
    ]
    outputs = llm.chat(conversation,
                       sampling_params=sampling_params,
                       use_tqdm=False)
    print_outputs(outputs)

Current issues:
Model loading is very slow as we load experts one by one Fixed
GGUF MoE is a very naive implementation and is very slow

I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this.

Closes #12436

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Comment on lines 1266 to 1276
# GGUF layer map assumes that we will have a merged expert weights
# so we need to map them manually
for idx in range(config.num_hidden_layers):
gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = \
f"model.layers.{idx}.mlp.gate.e_score_correction_bias"
gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.$EXP_ID$.down_proj.weight"
gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.$EXP_ID$.gate_proj.weight"
gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = \
f"model.layers.{idx}.mlp.experts.$EXP_ID$.up_proj.weight"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can try to avoid this manual mapping for each weight in MoE, perhaps you can refer to how transformers handle GGUF MoE weights name mapping:
https://github.com/huggingface/transformers/blob/847854b023a637caa18e6860dc2bdd47f7c05eb5/src/transformers/modeling_gguf_pytorch_utils.py#L314-L317

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the problem is that the weight loader expects the experts to be passed in one by one, trying to overcome it atm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay managed to add an option to load full expert weights at once to fused moe, still using experts.0 mapping because this is what deepseek_v2::load_weights expects, not sure it that's an issue

@junuMoon
Copy link

I followed your instruction but I got an error
Tho I'm reading your Pr and it's great job 👍

(VllmWorkerProcess pid=379357) /home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/parameter.py:167: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3683.)
(VllmWorkerProcess pid=379357)   return super().__torch_function__(func, types, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/utils.py", line 2224, in run_method
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     shared_output = self.shared_experts(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 90, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                  ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/linear.py", line 400, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 185, in apply
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     out = _fuse_mul_mat(x, qweight, qweight_type)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 98, in _fuse_mul_mat
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return x @ qweight.T
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ~~^~~~~~~~~~~
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] RuntimeError: size mismatch, got input (10000), mat (10000x7168), vec (0)

@SzymonOzog
Copy link
Contributor Author

I followed your instruction but I got an error Tho I'm reading your Pr and it's great job 👍

(VllmWorkerProcess pid=379357) /home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/parameter.py:167: UserWarning: The use of `x.T` on tensors of dimension other than 2 to reverse their shape is deprecated and it will throw an error in a future release. Consider `x.mT` to transpose batches of matrices or `x.permute(*torch.arange(x.ndim - 1, -1, -1))` to reverse the dimensions of a tensor. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3683.)
(VllmWorkerProcess pid=379357)   return super().__torch_function__(func, types, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Exception in worker VllmWorkerProcess while processing method determine_num_available_blocks.
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] Traceback (most recent call last):
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/executor/multiproc_worker_utils.py", line 236, in _run_worker_process
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output = run_method(worker, method, args, kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/utils.py", line 2224, in run_method
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/worker.py", line 229, in determine_num_available_blocks
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.model_runner.profile_run()
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1234, in profile_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1345, in _dummy_run
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     self.execute_model(model_input, kv_caches, intermediate_tensors)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return func(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/worker/model_runner.py", line 1718, in execute_model
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_or_intermediate_states = model_executable(
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                                     ^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 677, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.model(input_ids, positions, kv_caches,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/compilation/decorators.py", line 172, in __call__
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self.forward(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 633, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states, residual = layer(positions, hidden_states,
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 560, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     hidden_states = self.mlp(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     shared_output = self.shared_experts(hidden_states)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/models/deepseek_v2.py", line 90, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     gate_up, _ = self.gate_up_proj(x)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                  ^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return self._call_impl(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return forward_call(*args, **kwargs)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/linear.py", line 400, in forward
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     output_parallel = self.quant_method.apply(self, input_, bias)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 185, in apply
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     out = _fuse_mul_mat(x, qweight, qweight_type)
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]   File "/home/fran/vllm/vllm/model_executor/layers/quantization/gguf.py", line 98, in _fuse_mul_mat
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]     return x @ qweight.T
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242]            ~~^~~~~~~~~~~
(VllmWorkerProcess pid=379357) ERROR 02-13 14:23:10 multiproc_worker_utils.py:242] RuntimeError: size mismatch, got input (10000), mat (10000x7168), vec (0)

Which of the quantized models are you trying to load?

@junuMoon
Copy link

@SzymonOzog
Copy link
Contributor Author

https://huggingface.co/unsloth/DeepSeek-R1-GGUF/tree/main/DeepSeek-R1-UD-IQ1_S this one

Just tested and it works for me in a freshly checked out repo, are you sure that you merged gguf weights into one file? Could you share the scripts you are testing with?

@chuangzhidan
Copy link

chuangzhidan commented Feb 14, 2025

met an error:
(base) ubuntu@localhost:/media/data/scripts$ python start_gguf.py
INFO 02-14 10:44:20 init.py:190] Automatically detected platform cuda.
Traceback (most recent call last):
File "/media/data/xgp/scripts/start_gguf.py", line 3, in
llm = LLM(
^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/vllm/utils.py", line 1051, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/.local/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 212, in init
engine_args = EngineArgs(
^^^^^^^^^^^
TypeError: EngineArgs.init() got an unexpected keyword argument 'hf_config_path'
not sure where went wrong

(base) ubuntu@localhost:/media/data/xgp/scripts$ pip show vllm
Name: vllm
Version: 0.7.2

base_dir="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE"
from vllm import LLM, SamplingParams
llm = LLM(
# model="/YOUR_PATH/DeepSeek_Unsloth/DeepSeek-R1-Q2_K/DeepSeek-R1-Q2_K.gguf",
model="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/DeepSeek-R1-UD-IQ1_S-merge.gguf",
tokenizer=base_dir,
hf_config_path=base_dir,
enforce_eager=True,
tensor_parallel_size=2,
trust_remote_code=True,
max_model_len=10000
)

/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/
-rw-rw-r-- 1 ubuntu ubuntu 1.6K Feb 14 10:38 config.json
-rw-rw-r-- 1 ubuntu ubuntu 11K Feb 14 10:37 configuration_deepseek.py
-rwxrwxrwx 1 root root 131G Feb 12 17:47 DeepSeek-R1-UD-IQ1_S-merge.gguf*
-rw-rw-r-- 1 ubuntu ubuntu 171 Feb 14 10:37 generation_config.json
-rw-rw-r-- 1 ubuntu ubuntu 74K Feb 14 10:37 modeling_deepseek.py
-rw-rw-r-- 1 ubuntu ubuntu 3.6K Feb 14 10:37 tokenizer_config.json
-rw-rw-r-- 1 ubuntu ubuntu 7.5M Feb 14 10:37 tokenizer.json

@SzymonOzog
Copy link
Contributor Author

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:

root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

@zlh1992
Copy link

zlh1992 commented Feb 15, 2025

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:

root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

Please show your detail environment?

@accupham
Copy link

Is this faster than llama.cpp for the unsloth quants? The llama.cpp version is also very unoptimized-- the GPUs sit mostly idle. Very eager to see it running on VLLM.

@seven1122
Copy link

when will be merged?

@chuangzhidan
Copy link

@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:

root@8b74d742fc51:~/vllm# pip show vllm
Name: vllm
Version: 0.7.3.dev4+ge152f295.precompiled

Please show your detail environment?

u are right ,it has something to do with vllm version and this pr environment.thank u

@seven1122
Copy link

seven1122 commented Feb 18, 2025

met a KeyError: 'model.embed_tokens.qweight_type'
IMG_014

Exception in worker VllmWorkerProcess while processing method load_model.
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist - packages/vllm/executor/multiproc_worker_utils.py", line
output = run_method(worker, method, args, kwargs)

File "/usr/local/lib/python3.12/dist - packages/vllm/utils.py", line 2220, in run_method
return func(*args, **kwargs)

File "/usr/local/lib/python3.12/dist - packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/usr/local/lib/python3.12/dist - packages/vllm/worker/model_runner.py", line 1112, in
self.model = get_model(vllm_config=self.vllm_config)

File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/model_loader/init.py",
return loader.load_model(vllm_config=vllm_config)

File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/model_loader/loader.py",
model.load_weights()
File "/usr/local/lib/python3.12/dist - packages/vllm/model_executor/models/deepseek_v2.py",
param = params_dict[name]

KeyError: 'model.embed_tokens.qweight_type'

@zh-jp
Copy link

zh-jp commented Feb 18, 2025

I try to reproduce this PR and raise same error like @seven1122 .

[rank0]:   File "/home/X/new-vllm/vllm/worker/worker.py", line 183, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/X/new-vllm/vllm/worker/model_runner.py", line 1112, in load_model
[rank0]:     self.model = get_model(vllm_config=self.vllm_config)
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]:     return loader.load_model(vllm_config=vllm_config)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/model_loader/loader.py", line 1320, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/home/X/new-vllm/vllm/model_executor/models/deepseek_v2.py", line 808, in load_weights
[rank0]:     param = params_dict[name]
[rank0]:             ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.embed_tokens.qweight_type'

The checkpoints I used is DeepSeek-R1-UD-IQ1_S

I merged multi .gguf files to single by:

./llama-gguf-split --merge ~/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf single.gguf

File path ~/DeepSeek-R1-UD-IQ1_S includes :

DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf  DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf  DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf

@leolmj
Copy link

leolmj commented Feb 18, 2025

i raise the same error as @seven1122

@SzymonOzog
Copy link
Contributor Author

@leolmj @seven1122 @zh-jp

I'm having trouble reproducing the issue, could you share:

  • your config.json
  • your vllm version + the commit hash that you checked out
  • the script that you are runnning

@zh-jp
Copy link

zh-jp commented Feb 18, 2025

Hello @SzymonOzog .

  • config.json
  • The version of vllm is 0.7.2, and I have replaced the files that were specified in this PR.
  • Regarding the script, I used the demo you provided and only changed the parameter model that passed in the .gguf file path.

@SzymonOzog
Copy link
Contributor Author

SzymonOzog commented Feb 18, 2025

@zh-jp
You also need to change your dtype in config from bfloat16 to float16. Also could you check out this PR and run it through it? There have been changes in vllm since 0.7.2 and I cannot promise backwards compatibility

@davidsyoung
Copy link

I want to run this, but unfortunately I only have 14x3090 GPU's, so for tensor parallelism I need another 2 GPU's to get to 16. It would be great to see any kind of benchmarks on this compared to llama.cpp. Thank you!

@zh-jp
Copy link

zh-jp commented Feb 19, 2025

@SzymonOzog thanks for your valuable suggestions. I build the vllm from deepseek-gguf branch from your vllm repo. And successfully execute DeepSeek-R1-UD-IQ1_S in 4 NVIDIA A800-SXM4-80GB

@davidsyoung
Copy link

@SzymonOzog thanks for your valuable suggestions. I install the vllm from deepseek-gguf branch from your vllm repo. And successfully execute DeepSeek-R1-UD-IQ1_S in 4 NVIDIA A800-SXM4-80GB

Do you have a benchmark of performance?

@slr1997
Copy link

slr1997 commented Feb 19, 2025

@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least?

@junuMoon
Copy link

@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least?

INFO 02-19 22:08:59 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

Based on 8 x A100 GPUs, it's showing around 7 tokens/s

@SzymonOzog
Copy link
Contributor Author

I also had issues with long context but that got resolved after switching to bfloat16. This was caused by model outputing NaN after some accumulation and causing a token 0 to get emited(begin of sequence)

@SzymonOzog
Copy link
Contributor Author

@SzymonOzog Would it be possible to support i-matrix quants? It would be really useful to squeeze as much performance as possible! Thank you

No plans at the moment, I'm using Q_4_K and plan to invest time mostly in improving Q_K quants

@davidsyoung
Copy link

I also had issues with long context but that got resolved after switching to bfloat16. This was caused by model outputing NaN after some accumulation and causing a token 0 to get emited(begin of sequence)

Seems to be happening still to me with bfloat16.

Could you give me your run command, and what commit you’re on?

Seems to happen later in context. Have tried a couple of PR’s as well, with no luck. Have tried to redownload the quant, my own quant, etc.

@davidsyoung
Copy link

Hey @SzymonOzog - I'm still having the same problem unfortunately without being any closer to resolving exactly why. Would it be possible to get a gentle nudge in the right direction on what I could look for, or what commit I could run to test?

It's exactly the same as you're saying, it seems like a 0 token is emitted after a while and the topic changes to something completely different towards the end of a response.

@SzymonOzog
Copy link
Contributor Author

@davidsyoung I'm running on #14666 with no changes to the default except tensor_parallel=8 and max_model_len=20000 but that's just my memory limitations

@joshuakoh1
Copy link

What do I need to run this with native 0.8.0 version?

2025-03-19T20:59:41+00:00 - gpustack.worker.backends.vllm - ERROR - Failed to derive max model length: 
2025-03-19T20:59:41+00:00 - gpustack.worker.backends.vllm - INFO - Starting vllm server
INFO 03-19 20:59:46 [__init__.py:256] Automatically detected platform cuda.
INFO 03-19 20:59:47 [api_server.py:977] vLLM API server version 0.8.0
INFO 03-19 20:59:47 [api_server.py:978] args: Namespace(subparser='serve', model_tag='/mnt/shared/merged_models/DeepSeek-R1-UD-IQ1_M/DeepSeek-R1-UD-IQ1_M.gguf', config='', host='0.0.0.0', port=40600, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/shared/merged_models/DeepSeek-R1-UD-IQ1_M/DeepSeek-R1-UD-IQ1_M.gguf', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=15000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=12, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=5.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='gguf', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1-iq1m-merged'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7e0df3a7e480>)
Traceback (most recent call last):
  File "/var/lib/gpustack/bin/vllm_v0.8.0", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd
    uvloop.run(run_server(args))
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1012, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 141, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 161, in build_async_engine_client_from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1206, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1121, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/config.py", line 333, in __init__
    hf_config = get_config(self.hf_config_path or self.model,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 280, in get_config
    config_dict, _ = PretrainedConfig.get_config_dict(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/configuration_utils.py", line 594, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/configuration_utils.py", line 685, in _get_config_dict
    config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 399, in load_gguf_checkpoint
    raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.")
ValueError: GGUF model with architecture deepseek2 is not supported yet.

@joshuakoh1
Copy link

What do I need to run this with native 0.8.0 version?

2025-03-19T20:59:41+00:00 - gpustack.worker.backends.vllm - ERROR - Failed to derive max model length: 
2025-03-19T20:59:41+00:00 - gpustack.worker.backends.vllm - INFO - Starting vllm server
INFO 03-19 20:59:46 [__init__.py:256] Automatically detected platform cuda.
INFO 03-19 20:59:47 [api_server.py:977] vLLM API server version 0.8.0
INFO 03-19 20:59:47 [api_server.py:978] args: Namespace(subparser='serve', model_tag='/mnt/shared/merged_models/DeepSeek-R1-UD-IQ1_M/DeepSeek-R1-UD-IQ1_M.gguf', config='', host='0.0.0.0', port=40600, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/mnt/shared/merged_models/DeepSeek-R1-UD-IQ1_M/DeepSeek-R1-UD-IQ1_M.gguf', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', max_model_len=15000, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=12, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=None, swap_space=4, cpu_offload_gb=5.0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization='gguf', rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, use_tqdm_on_load=True, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-r1-iq1m-merged'], qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, additional_config=None, enable_reasoning=False, reasoning_parser=None, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7e0df3a7e480>)
Traceback (most recent call last):
  File "/var/lib/gpustack/bin/vllm_v0.8.0", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main
    args.dispatch_function(args)
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd
    uvloop.run(run_server(args))
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1012, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 141, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 161, in build_async_engine_client_from_engine_args
    vllm_config = engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1206, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1121, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/config.py", line 333, in __init__
    hf_config = get_config(self.hf_config_path or self.model,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 280, in get_config
    config_dict, _ = PretrainedConfig.get_config_dict(
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/configuration_utils.py", line 594, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/configuration_utils.py", line 685, in _get_config_dict
    config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/.local/share/pipx/venvs/vllm-v0-8-0/lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 399, in load_gguf_checkpoint
    raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.")
ValueError: GGUF model with architecture deepseek2 is not supported yet.

@SzymonOzog any ideas? Already passing the HF config files

@SzymonOzog
Copy link
Contributor Author

@joshuakoh1
How are you passing the hf_config_path variable? It's set to none in your logs

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
@lv03
Copy link

lv03 commented Apr 16, 2025

@SzymonOzog hello, I encountered some issues while loading DeepSeeker R1-UD-IQ1_S

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto

It seems to be stuck here
image

image

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:17:00.0 Off | Off |
| 49% 37C P0 78W / 450W | 33746MiB / 49140MiB | 36% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:3D:00.0 Off | Off |
| 62% 36C P0 84W / 450W | 33746MiB / 49140MiB | 36% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:63:00.0 Off | Off |
| 64% 36C P0 81W / 450W | 33746MiB / 49140MiB | 34% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 Off | 00000000:99:00.0 Off | Off |
| 62% 38C P0 94W / 450W | 33746MiB / 49140MiB | 33% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA A100 80GB PCIe Off | 00000000:AB:00.0 Off | 0 |
| N/A 43C P0 79W / 300W | 33795MiB / 81920MiB | 48% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 4090 Off | 00000000:BD:00.0 Off | Off |
| 62% 36C P0 89W / 450W | 33746MiB / 49140MiB | 37% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 4090 Off | 00000000:CF:00.0 Off | Off |
| 65% 40C P0 79W / 450W | 33746MiB / 49140MiB | 37% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA A100 80GB PCIe Off | 00000000:E1:00.0 Off | 0 |
| N/A 44C P0 85W / 300W | 33795MiB / 81920MiB | 49% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 124582 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 1 N/A N/A 124955 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 2 N/A N/A 124956 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 3 N/A N/A 124957 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 4 N/A N/A 124958 C ...niconda3/envs/vLLM/bin/python 33786MiB |
| 5 N/A N/A 124959 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 6 N/A N/A 124960 C ...niconda3/envs/vLLM/bin/python 33736MiB |
| 7 N/A N/A 124961 C ...niconda3/envs/vLLM/bin/python 33786MiB |
+-----------------------------------------------------------------------------------------+

@zhaotyer
Copy link
Contributor

@SzymonOzog hello, I encountered some issues while loading DeepSeeker R1-UD-IQ1_S

CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto

It seems to be stuck here image

image

+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.124.04 Driver Version: 570.124.04 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:17:00.0 Off | Off | | 49% 37C P0 78W / 450W | 33746MiB / 49140MiB | 36% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 1 NVIDIA GeForce RTX 4090 Off | 00000000:3D:00.0 Off | Off | | 62% 36C P0 84W / 450W | 33746MiB / 49140MiB | 36% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 2 NVIDIA GeForce RTX 4090 Off | 00000000:63:00.0 Off | Off | | 64% 36C P0 81W / 450W | 33746MiB / 49140MiB | 34% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 3 NVIDIA GeForce RTX 4090 Off | 00000000:99:00.0 Off | Off | | 62% 38C P0 94W / 450W | 33746MiB / 49140MiB | 33% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 4 NVIDIA A100 80GB PCIe Off | 00000000:AB:00.0 Off | 0 | | N/A 43C P0 79W / 300W | 33795MiB / 81920MiB | 48% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+ | 5 NVIDIA GeForce RTX 4090 Off | 00000000:BD:00.0 Off | Off | | 62% 36C P0 89W / 450W | 33746MiB / 49140MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 6 NVIDIA GeForce RTX 4090 Off | 00000000:CF:00.0 Off | Off | | 65% 40C P0 79W / 450W | 33746MiB / 49140MiB | 37% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ | 7 NVIDIA A100 80GB PCIe Off | 00000000:E1:00.0 Off | 0 | | N/A 44C P0 85W / 300W | 33795MiB / 81920MiB | 49% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | 0 N/A N/A 124582 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 1 N/A N/A 124955 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 2 N/A N/A 124956 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 3 N/A N/A 124957 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 4 N/A N/A 124958 C ...niconda3/envs/vLLM/bin/python 33786MiB | | 5 N/A N/A 124959 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 6 N/A N/A 124960 C ...niconda3/envs/vLLM/bin/python 33736MiB | | 7 N/A N/A 124961 C ...niconda3/envs/vLLM/bin/python 33786MiB | +-----------------------------------------------------------------------------------------+

me too, Have you solved this problem?

@zhaotyer
Copy link
Contributor

@joshuakoh1 How are you passing the hf_config_path variable? It's set to none in your logs

1744885087776
vllm stuck at There is no support for fast MoE kernel for current quantization method. Falling back to slow implementation.
vllm version is: 0.8.4

@SzymonOzog
Copy link
Contributor Author

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

@zhaotyer
Copy link
Contributor

@joshuakoh1 How are you passing the hf_config_path variable? It's set to none in your logs

1744885087776 vllm stuck at There is no support for fast MoE kernel for current quantization method. Falling back to slow implementation. vllm version is: 0.8.4

@SzymonOzog Is there a solution to this problem?

@SzymonOzog
Copy link
Contributor Author

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

@zhaotyer
Copy link
Contributor

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

Thank you for your reply, looking forward to merging the code

@lv03
Copy link

lv03 commented Apr 17, 2025

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants我认为发生这种情况是因为 I 量化使用非常慢的 moe 实现,并且对于序列长度为 102400,它需要处理很长时间, #16780 应该增加对 MoE I 量化的更好支持

After loading, the following problem occurred,
image

I saw someone reported this issue before. If don't change I quants,How should I deal with this problem?
image

@SzymonOzog
Copy link
Contributor Author

For now you can run with enforce_eager=True although it will be slow, the PR I mentioned above should also fix this issue

@lv03
Copy link

lv03 commented Apr 17, 2025

For now you can run with enforce_eager=True although it will be slow, the PR I mentioned above should also fix this issue现在你可以用 enforce_eager=True 运行,虽然它会很慢,但我上面提到的 PR 也应该可以解决这个问题

thank you,Will the next version of vllm solve this problem?

@SzymonOzog
Copy link
Contributor Author

That depends on when the PR will get merged onto main

@zhaotyer
Copy link
Contributor

Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue

Hello, could you provide a docker images url? The network here is not good and docker build always fails

@zhaotyer
Copy link
Contributor

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start
I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@zhaotyer
Copy link
Contributor

I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants

when I set max_model_len to 8192, The service will crash when it start I tested it on 2xA100x80GB and 8xL40Sx45GB, and both showed errors.

vllm serve /models/DeepSeek-R1-UD-IQ1_S/merged_file.gguf -tp 2 --trust-remote-code --enforce-eager --trust-remote-code --tokenizer /models/DeepSeek-R1-UD-IQ1_S/ --hf-config-path /models/DeepSeek-R1-UD-IQ1_S/ --dtype bfloat16 --max-model-len 8192 --served-model-name atom --port 8160 --gpu-memory-utilization 0.95

error log

INFO 04-22 04:13:35 [model_runner.py:1146] Model loading took 66.5477 GiB and 216.481170 seconds
ERROR 04-22 04:13:40 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 04:13:40 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 436, in run_mp_engine
ERROR 04-22 04:13:40 [engine.py:448]     engine = MQLLMEngine.from_vllm_config(
ERROR 04-22 04:13:40 [engine.py:448]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 128, in from_vllm_config
ERROR 04-22 04:13:40 [engine.py:448]     return cls(
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/multiprocessing/engine.py", line 82, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self.engine = LLMEngine(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 285, in __init__
ERROR 04-22 04:13:40 [engine.py:448]     self._initialize_kv_caches()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/llm_engine.py", line 434, in _initialize_kv_caches
ERROR 04-22 04:13:40 [engine.py:448]     self.model_executor.determine_num_available_blocks())
ERROR 04-22 04:13:40 [engine.py:448]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     results = self.collective_rpc("determine_num_available_blocks")
ERROR 04-22 04:13:40 [engine.py:448]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 331, in collective_rpc
ERROR 04-22 04:13:40 [engine.py:448]     return self._run_workers(method, *args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/mp_distributed_executor.py", line 185, in _run_workers
ERROR 04-22 04:13:40 [engine.py:448]     driver_worker_output = run_method(self.driver_worker, sent_method,
ERROR 04-22 04:13:40 [engine.py:448]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils.py", line 2428, in run_method
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
ERROR 04-22 04:13:40 [engine.py:448]     self.model_runner.profile_run()
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1243, in profile_run
ERROR 04-22 04:13:40 [engine.py:448]     self._dummy_run(max_num_batched_tokens, max_num_seqs)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
ERROR 04-22 04:13:40 [engine.py:448]     self.execute_model(model_input, kv_caches, intermediate_tensors)
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-22 04:13:40 [engine.py:448]     return func(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/worker/model_runner.py", line 1770, in execute_model
ERROR 04-22 04:13:40 [engine.py:448]     hidden_or_intermediate_states = model_executable(
ERROR 04-22 04:13:40 [engine.py:448]                                     ^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 703, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 172, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 660, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states, residual = layer(positions, hidden_states, residual)
ERROR 04-22 04:13:40 [engine.py:448]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 580, in forward
ERROR 04-22 04:13:40 [engine.py:448]     hidden_states = self.mlp(hidden_states)
ERROR 04-22 04:13:40 [engine.py:448]                     ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 159, in forward
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.experts(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return self._call_impl(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1750, in _call_impl
ERROR 04-22 04:13:40 [engine.py:448]     return forward_call(*args, **kwargs)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 842, in forward
ERROR 04-22 04:13:40 [engine.py:448]     return self.forward_impl(hidden_states, router_logits)
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 861, in forward_impl
ERROR 04-22 04:13:40 [engine.py:448]     final_hidden_states = self.quant_method.apply(
ERROR 04-22 04:13:40 [engine.py:448]                           ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 377, in apply
ERROR 04-22 04:13:40 [engine.py:448]     return _fused_moe_gguf(x, layer.w13_qweight, layer.w2_qweight,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/gguf.py", line 176, in _fused_moe_gguf
ERROR 04-22 04:13:40 [engine.py:448]     out = ops.ggml_moe_a8_vec(out, w2, topk_ids, 1, qweight_type2,
ERROR 04-22 04:13:40 [engine.py:448]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/vllm/_custom_ops.py", line 1179, in ggml_moe_a8_vec
ERROR 04-22 04:13:40 [engine.py:448]     return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1123, in __call__
ERROR 04-22 04:13:40 [engine.py:448]     return self._op(*args, **(kwargs or {}))
ERROR 04-22 04:13:40 [engine.py:448]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-22 04:13:40 [engine.py:448] RuntimeError: CUDA error: invalid configuration argument
ERROR 04-22 04:13:40 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 04:13:40 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 04:13:40 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

when I set max_model_len to 8192,The specific parameter information when the following command reports an error is as follows

    return torch.ops._C.ggml_moe_a8_vec(X, W, topk_ids, top_k, quant_type, row,
                                        tokens)
(VllmWorkerProcess pid=3264) INFO 04-22 05:19:17 [model_runner.py:1146] Model loading took 66.5477 GiB and 241.765538 seconds
INFO 04-22 05:19:32 [model_runner.py:1146] Model loading took 66.5477 GiB and 257.481310 seconds
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([8192, 7168]), w is:torch.Size([256, 2048, 1400]), topk_ids is:torch.Size([8192, 8]),top_k is:8, quant_type is:19, row is:<class 'torch.SymInt'>, tokens is:8192
ERROR 04-22 05:19:37 [_custom_ops.py:1179] x is:torch.Size([65536, 1024]), w is:torch.Size([256, 7168, 264]), topk_ids is:torch.Size([8192, 8]),top_k is:1, quant_type is:16, row is:<class 'torch.SymInt'>, tokens is:65536
ERROR 04-22 05:19:37 [engine.py:448] CUDA error: invalid configuration argument
ERROR 04-22 05:19:37 [engine.py:448] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 04-22 05:19:37 [engine.py:448] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 04-22 05:19:37 [engine.py:448] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 04-22 05:19:37 [engine.py:448] Traceback (most recent call last):
ERROR 04-22 0

@SzymonOzog

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
@hahmad2008
Copy link

@SzymonOzog
can we serve this model using vllm?
NoelJacob/Meta-Llama-3-8B-Instruct-Q4_K_M-GGUF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding structured-output v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Feature]: Deepseek R1 GGUF 4bit(Q4KM) support