-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
VLLM for Qwen 2.5 72B produces all !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! outputs, regardless of prompt given GPTQ 4 bits quantization #14126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is VLLM compatible with GPTQ 4bit quantization of Qwen instruct? has anyone run this successfully? |
maybe fixed by #11493 |
Sorry can you elaborate? I looked at the PR and I do not know what I should do to fix the problem. I am not passing the quantization parameter to LLM, so I think I am using GTPQ marlin kernel. But I still have the error. So technically based on the PR I should not even have the issue in the first place since the PR is merged. |
May I ask what version of vllm you have? |
I use |
Please take a look and help |
|
I will try the Qwen2 one and let you know. I do not think I have Nan in my parameters. |
Any GPTQ-Int4 model from the official hf repository is fine, it doesn't have to be qwen2, it can be qwen2.5. |
Coulf you plz provibe your running script? |
#13035 possibly related? |
Tried the Qwen/Qwen2.5-72B-Instruct-GPTQ-Int4 from huggingface, and this worked. To be honest the performance was extremely weak. I see the configuration for that was desc_order = False, group_size = 128. |
|
After more testing, I realize bad performance is due to long context length. Which was strange since the same config was working great with Llama. @jeejeelee @noooop Do you think this is because of using AutoGPTQ package or because of configuration of desc_order and group size as I explained above. |
According to #11493, Qwen model NAN results will lead to !!!!! output. The gpu to cpu conversion actually happens in sampler. Before this, adding nan detection will synchronize the cuda stream, resulting in performance degradation. There is no particularly good place to run a runtime check hidden_or_intermediate_states for nan |
I see,why would it produce NAN result: Just trying to understand the action I need to take to resolve this, should I requantize the model, should I change the model parameters |
We first locate and confirm the problem, then try to solve it |
Is there any way you can modify the vllm code (in python site-packages) to output
before vllm/vllm/worker/model_runner.py Line 1788 in bb5b640
Is it too hacky, but reinstalling vllm from source takes a long time |
Sure, given my setup it is pretty difficult to modify a package and install it from that source but try to test this idea. |
@noooop Thank you for providing this very useful information. I will verify the NaN output ASAP |
@noooop @manitadayon I can reproduce this issue by using Qwen1.5-14B-Chat-GPTQ, and now I've implemented a temporary solution which could fix this issue locally, please see: https://github.com/jeejeelee/vllm/blob/qwen2-overflow-clamp/vllm/model_executor/models/qwen2.py#L237-L246. import vllm
from vllm import SamplingParams
MODEL_PATH = "/model/Qwen1.5-14B-Chat-GPTQ"
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
template = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"
prompts = [template.format(question=prompt) for prompt in prompts]
sampling_params = SamplingParams(temperature=0.0, top_p=0.95)
llm = vllm.LLM(
MODEL_PATH,
max_num_seqs=2,
trust_remote_code=True,
max_model_len=1024,
tensor_parallel_size=2,
enforce_eager=True,
)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
|
Didn't expect that I thought this code had been optimized for several years and bugs free already. I think all GPTQ and even all quantized models using fp16 will be affected. This means we need to add cast_overflow_tensors to all quantized layers or We need to modify the cuda kernel to solve this problem In fact, I don't know why quantized model use fp16 as default dtype, and converte the bf16 model to fp16. awq Marlin and GPTQ Marlin both support bf16
|
@noooop I agree, I just want to give you feedback on the results of my testing |
@mgoin Could you plz look at this thread, thanks |
Is there a model uploaded to HF that I can reproduce with? I would assume this issue is specific to I found a Qwen GPTQ model with
So I'm going to say I haven't been able to reproduce this yet. |
I just reproduced the error again. Unfortunately I cannot upload the model to HF. My config was this time to use HF GPTQ quantization (as opposed to AutoGPTQ) with group_size = 32, desc_order= False (I have tried true as well with no luck). May I know if you guys pass any external parameter to LLM besides model_id and max_model_len? |
@mgoin Thanks for your response, could you test with TP=2? I tested locally and TP=1 produced reasonable results. |
After more testing, I don't think NaN is causing this. https://huggingface.co/bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF/discussions/5 For similar problem, and the issue was corrupted files. This is not even Qwen model. |
@manitadayon Could #13750 be related? |
@rainkert no, that PR is strictly fixing a bug with gptq_marlin for MoE layers |
may be related |
One thing I don't understand is that with whatever config I try (besides trying with different data or very large group size, which I can try them later on) the Qwen 72B-instruct would produce all !!!!!!!! For the output. |
may be related |
Just want to let you all know the problem is solved, the problem was the data type and issue with float16. |
We need to think of a systematic way to solve the float16 overflow problem |
may be related |
I'm the reporter of #14715 mentioned above. Sorry for the possible duplicate, "outputs all !!!!!" is hard to search for, and I expected it to be ROCm related. Key differences: I see it only on ROCm but not CUDA for the same setup. I also only see it for short prompts. Long prompts are successful. |
I think your problem is similar to #13232 Please help to try
|
This prints true in the failing case:
If I put in more input (in my issue in #14715 it's input length dependent) I get good output and:
However, the patch you mention with Diff with suggested patch
|
Please help locate which part the nan occurs. you need to run it in --enforce-eager mode self_attn vllm/vllm/model_executor/models/qwen2.py Line 243 in bb5b640
or mlp vllm/vllm/model_executor/models/qwen2.py Line 251 in bb5b640
I believe nan comes from mlp You can also further locate nan from gate_up_proj or down_proj
|
During the query:
I include more of the gate_up_proj/down_proj output because of the repeating pattern (maybe I didn't let the above run long enough to see it) Alternating pattern of not nan, nan
|
Can adding cast_overflow_tensors in mlp solve this problem temporarily? specifically:
|
So I looked at the definition of
So I tried again with With that, and the original patch (the casts in self attention) and/or the second suggestions (casts with gate_up/down_proj) I no longer get Sample of different garbage
|
You can only clamp the value instead of setting it to 0 cast_overflow_tensors will cause gpu sync, which will definitely be very, very slow. @jeejeelee @mgoin |
Uh oh!
There was an error while loading. Please reload this page.
Your current environment
I performed GPTQ quantization on Qwen 72B instruct using AutoGPTQ package, with the following configuration:
group_size = 32, desc_order= 32.
Then I use the model inside the VLLM using the following configuration:
llm = LLM(model = model_path, max_model_len = 20000)
However regardless of prompt the outptut is always !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
The same code works perfectly fine for llama 3.3 and 3.1 70B.
Is Qwen 2.5 72B not compatible with VLLM.
I have the latest version of VLLM and Transformers using
Any help would be appreciated.
🐛 Describe the bug
The output is always !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! no matter the input and the prompt or other configurations.
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: