-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Bug]: xgrammar crashes with speculative decoding #11484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Encountered same issue with llama 3.3 70B on v0.6.6. |
Same issue here qwen32b_AWQ also on v0.6.6. However, for me, it failed with guided decoding. I tried all 3 guided decoding options. |
I see a script for starting vllm. Do you also have a sample API request that demonstrates the problem? |
For me, any guided json decoding fails whenever speculative ngram decoding is enabled. Here's an example: system_prompt = """
Fill the following json schema for a character creator in D&D:
{
"name": "string",
"race": "string",
"class": "string",
"level": "int",
"background": "string",
"alignment": "string",
"backstory": "string"
}
"""
user_prompt = "Make me a bunch of characters from Jason Bourne movies. Output a list (array) of character json objects."
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
json_schema_multiple = {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"race": {"type": "string"},
"class": {"type": "string"},
"level": {"type": "integer"},
"background": {"type": "string"},
"alignment": {"type": "string"},
"backstory": {"type": "string"}
},
"required": ["name", "race", "class", "level", "background", "alignment",
"backstory"]
}
}
result = client.chat.completions.create(
model=model_id,
messages=messages,
max_tokens=3000,
temperature=0.,
stream=False,
extra_body={"guided_json": json_schema_multiple}
) The client is initialised as usual and the serve command is vllm serve "Qwen/Qwen2.5-7B-Instruct-AWQ" \
--max-model-len 4096 \
--dtype "auto" \
--trust-remote-code \
--enable-prefix-caching \
--speculative-model "[ngram]" --num-speculative-tokens 5 --ngram-prompt-lookup-max 4 Tried on both |
Thanks! It's easy to reproduce with this example. I'll look into it. |
I've looked into this and it seems the two features just fundamentally do not work together. I'm going to update the feature compatibility matrix in the docs to reflect this. |
Update the feature compatibility matrix to reflect that speculative decoding and structured output do not currently work together. Related to issue vllm-project#11484 Signed-off-by: Russell Bryant <[email protected]>
@russellb thanks for looking into it. Do you think there is a possibility to disable speculative decoding whenever a guided request is submitted? In some pipelines it would be nice to have it available whenever one does a plain generation request. |
That's a good idea. My first thought was to at least make vllm behave better, perhaps respond with a 400 error of some type instead of just crashing! Your idea sounds like a good step to look at after that. |
PR #12484 will make vllm handle this failure more gracefully. vllm won't crash and you'll get a "409 Conflict" response from the API server. |
@russellb I noticed that xgrammar crashes with speculative decoding, but outlines works (though its slower). Another solution would be to switch to outlines when needed to keep everything working. It's already done for certain features that xgrammar doesn't support like list selection. |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
No response
🐛 Describe the bug
When I use xgrammer as guided decoding backend, it crashes with speculative decoding. It works well without speculative decoding.
shell script:
output:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: