-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Model] Deepseek GGUF support #13167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
aec8431
to
1038380
Compare
# GGUF layer map assumes that we will have a merged expert weights | ||
# so we need to map them manually | ||
for idx in range(config.num_hidden_layers): | ||
gguf_to_hf_name_map[f"blk.{idx}.exp_probs_b.bias"] = \ | ||
f"model.layers.{idx}.mlp.gate.e_score_correction_bias" | ||
gguf_to_hf_name_map[f"blk.{idx}.ffn_down_exps.weight"] = \ | ||
f"model.layers.{idx}.mlp.experts.$EXP_ID$.down_proj.weight" | ||
gguf_to_hf_name_map[f"blk.{idx}.ffn_gate_exps.weight"] = \ | ||
f"model.layers.{idx}.mlp.experts.$EXP_ID$.gate_proj.weight" | ||
gguf_to_hf_name_map[f"blk.{idx}.ffn_up_exps.weight"] = \ | ||
f"model.layers.{idx}.mlp.experts.$EXP_ID$.up_proj.weight" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can try to avoid this manual mapping for each weight in MoE, perhaps you can refer to how transformers
handle GGUF MoE weights name mapping:
https://github.com/huggingface/transformers/blob/847854b023a637caa18e6860dc2bdd47f7c05eb5/src/transformers/modeling_gguf_pytorch_utils.py#L314-L317
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah the problem is that the weight loader expects the experts to be passed in one by one, trying to overcome it atm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay managed to add an option to load full expert weights at once to fused moe, still using experts.0
mapping because this is what deepseek_v2::load_weights
expects, not sure it that's an issue
I followed your instruction but I got an error
|
Which of the quantized models are you trying to load? |
Just tested and it works for me in a freshly checked out repo, are you sure that you merged gguf weights into one file? Could you share the scripts you are testing with? |
met an error: (base) ubuntu@localhost:/media/data/xgp/scripts$ pip show vllm base_dir="/data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE" /data/llm/DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S_MERGE/ |
@chuangzhidan Did you check out and install vllm from this PR? You seem to have a different version:
|
Please show your detail environment? |
Is this faster than llama.cpp for the unsloth quants? The llama.cpp version is also very unoptimized-- the GPUs sit mostly idle. Very eager to see it running on VLLM. |
when will be merged? |
u are right ,it has something to do with vllm version and this pr environment.thank u |
I try to reproduce this PR and raise same error like @seven1122 . [rank0]: File "/home/X/new-vllm/vllm/worker/worker.py", line 183, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/home/X/new-vllm/vllm/worker/model_runner.py", line 1112, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/X/new-vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/X/new-vllm/vllm/model_executor/model_loader/loader.py", line 1320, in load_model
[rank0]: model.load_weights(
[rank0]: File "/home/X/new-vllm/vllm/model_executor/models/deepseek_v2.py", line 808, in load_weights
[rank0]: param = params_dict[name]
[rank0]: ~~~~~~~~~~~^^^^^^
[rank0]: KeyError: 'model.embed_tokens.qweight_type' The checkpoints I used is DeepSeek-R1-UD-IQ1_S I merged multi .gguf files to single by: ./llama-gguf-split --merge ~/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf single.gguf File path ~/DeepSeek-R1-UD-IQ1_S includes : DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf |
i raise the same error as @seven1122 |
I'm having trouble reproducing the issue, could you share:
|
Hello @SzymonOzog .
|
@zh-jp |
I want to run this, but unfortunately I only have 14x3090 GPU's, so for tensor parallelism I need another 2 GPU's to get to 16. It would be great to see any kind of benchmarks on this compared to llama.cpp. Thank you! |
@SzymonOzog thanks for your valuable suggestions. I build the vllm from |
Do you have a benchmark of performance? |
@zh-jp Did you test the speed compared with the llama.cpp? And how much memory does it need at least? |
INFO 02-19 22:08:59 metrics.py:455] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. Based on 8 x A100 GPUs, it's showing around 7 tokens/s |
I also had issues with long context but that got resolved after switching to bfloat16. This was caused by model outputing NaN after some accumulation and causing a token 0 to get emited(begin of sequence) |
No plans at the moment, I'm using Q_4_K and plan to invest time mostly in improving Q_K quants |
Seems to be happening still to me with bfloat16. Could you give me your run command, and what commit you’re on? Seems to happen later in context. Have tried a couple of PR’s as well, with no luck. Have tried to redownload the quant, my own quant, etc. |
Hey @SzymonOzog - I'm still having the same problem unfortunately without being any closer to resolving exactly why. Would it be possible to get a gentle nudge in the right direction on what I could look for, or what commit I could run to test? It's exactly the same as you're saying, it seems like a 0 token is emitted after a while and the topic changes to something completely different towards the end of a response. |
@davidsyoung I'm running on #14666 with no changes to the default except |
What do I need to run this with native 0.8.0 version?
|
@SzymonOzog any ideas? Already passing the HF config files |
@joshuakoh1 |
Signed-off-by: Louis Ulmer <[email protected]>
@SzymonOzog hello, I encountered some issues while loading DeepSeeker R1-UD-IQ1_S CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve ./merged_file.gguf --tokenizer ../config_file/ --hf-config-path ../config_file/ --tensor-parallel-size 8 --max-model-len 102400 --gpu-memory-utilization 0.5 --port 8000 --dtype auto +-----------------------------------------------------------------------------------------+ +-----------------------------------------------------------------------------------------+ |
me too, Have you solved this problem? |
|
I think this is happening because I quants use a very slow implementation of moe and for sequence length of 102400 it is processing for a long time, #16780 should add better support for MoE I quants |
@SzymonOzog Is there a solution to this problem? |
Yes, the PR I mentioned in the comment above should speed up I quants, that might reslove the issue |
Thank you for your reply, looking forward to merging the code |
After loading, the following problem occurred, I saw someone reported this issue before. If don't change I quants,How should I deal with this problem? |
For now you can run with |
thank you,Will the next version of vllm solve this problem? |
That depends on when the PR will get merged onto main |
Hello, could you provide a docker images url? The network here is not good and docker build always fails |
when I set max_model_len to 8192, The service will crash when it start
error log
|
when I set max_model_len to 8192,The specific parameter information when the following command reports an error is as follows
|
@SzymonOzog |
This adds support for quantized deepseek versions from Unsloth:
Currently Huggingface does not support deepseek so I added an option to add an override path where we can read the correct config from.
To run at the moment one needs to:
When initializing our deepseek model we need to pass the paths to our huggingface config and tokenizer:
Current issues:
Model loading is very slow as we load experts one by oneFixedGGUF MoE is a very naive implementation and is very slow
I plan to continue working on solving the aforementioned issues, can do this in this PR or future ones, sharing already because there seem to be a demand for running this.
Closes #12436