[Frontend] Add backend-specific options for guided decoding #13505

joerunde · 2025-02-19T00:19:42Z

This PR extends the guided decoding backend name to support a list of backend-specific options. These are specified in a comma separated list after a colon. This allows us to easily add support for backend-specific features, like xgrammar:json-any-whitespace.

Specifically this PR also implements the no-fallback option. When set vLLM will return a very nicely formatted 400 describing how the guided decoding backend doesn't support what they asked for, instead of silently switching to a different backend. This is useful for users who want to only support a single guided decoding backend.

A motivation for this was that a product that only intended to support the xgrammar backend had a user find a query that would fall back to outlines, and then cause outlines to hang indefinitely. (See dottxt-ai/outlines-core#180)

Works around #12005 😉

Signed-off-by: Joe Runde <[email protected]>

github-actions · 2025-02-19T00:19:54Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

markmc · 2025-02-19T15:08:22Z

vllm/envs.py

@@ -585,6 +586,8 @@ def maybe_convert_int(value: Optional[str]) -> Optional[int]:
    # specify the path through environment variable VLLM_CUDART_SO_PATH.
    "VLLM_CUDART_SO_PATH":
    lambda: os.getenv("VLLM_CUDART_SO_PATH", None),
+    "VLLM_DISABLE_GUIDED_DECODING_FALLBACK":
+    lambda: bool(int(os.getenv("VLLM_DISABLE_GUIDED_DECODING_FALLBACK", "0"))),


Have we an established pattern on what should be config vs env variables? Why wouldn't this be in DecodingConfig? Maybe we could encode "don't fallback" in something like --guided-decoding-backend=outlines:nofallback if we were worried about a proliferation of CLI arguments.

I'm okay with either way, but I do think Mark's suggestion would be nicer. I like calling more attention to the --guided-decoding-backend argument if users want to be explicit about their backend

i like the cli arg approach ... before seeing this comment I was thinking about another "backend" like xgrammar-only or something like that. xgrammar:nofallback leaves it open to a bit more flexibility to specify additional options if necessary later, like xgrammar:nofallback,json-any-whitespace to support the case covered in #12744

Have we an established pattern on what should be config vs env variables?

Yeah ... I keep thinking about this. It's going to be a big project, but we're due for significant cleanup here. I'd really like a system that supports both config files and command line args (and less env vars unless it's just an alternative for setting the same set of options).

... but I have no idea when that's going to feel like the most important thing to work on!

Have we an established pattern on what should be config vs env variables?

Yeah ... I keep thinking about this.

Me too. I like to think that the environment variables change the 'behavior' of the system like using a deprecated | experimental | workaround feature. While config are the others 'common features' of the system that's up to the users to set or tune to their environment.

However there also something that makes sense to this discussion. When we set in the config we have a chance to log the system setup, like the log Initializing a V0 LLM engine (v%s) with config: [...] Sometimes it is tricky to get the exact setup of the system when we got a crash and the only thing that we get it is a stack trace (which may be truncated as well 😄) .

Probably we should prefer using args before envs , but when makes sense to use envs, we probably could log to the users (at least once) that vLLM has this feature on and the implications of that in the system.

Either way: I like the idea of --guided-decoding-backend=outlines:nofallback for this PR. And I'm pretty sure that it would be logged in the system initialization, which is nice for debugging purpose.

Thanks for the discussion everybody 👍

I stuck this in the environment to avoid the proliferation of cli args, but I love the suggestion of encoding the fallback behavior in the name of the backend. Best of both worlds!

I'll update the implementation

Code is updated if y'all wanna take a second look 🙏

mergify · 2025-02-19T15:08:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @joerunde.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Joe Runde <[email protected]>

russellb · 2025-02-20T14:21:47Z

tests/entrypoints/llm/test_guided_generate.py

+
+    with pytest.raises(
+            ValueError,
+            match="xgrammar does not support regex guided decoding"):


It will soon! :-). #13228

This is totally fine, though. I can tweak this once we're able to turn it on.

russellb

looks great to me! Thank you!

It would be good to update the text that shows up under vllm serve --help as well.

...
  --guided-decoding-backend {outlines,lm-format-enforcer,xgrammar}
                        Which engine will be used for guided decoding (JSON schema / regex etc) by default. Currently
                        support https://github.com/outlines-dev/outlines, https://github.com/mlc-ai/xgrammar, and
                        https://github.com/noamgat/lm-format-enforcer. Can be overridden per request via
                        guided_decoding_backend parameter.
...

Signed-off-by: Joe Runde <[email protected]>

joerunde · 2025-02-20T16:30:22Z

@russellb good catch, the CLI parser needed a bit of relaxing too.

@wallashss This now boots and logs the backend and option(s):

$ vllm serve Qwen/Qwen2.5-3B-Instruct --guided-decoding-backend xgrammar:no-fallback
... 
INFO 02-20 16:27:30 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.2.dev56+g49ad2b1b7.d20250211) with config: model='Qwen/Qwen2.5-3B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar:no-fallback'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=True,

Signed-off-by: Wallas Santos <[email protected]>

mgoin

Nice work, LGTM!

…ject#13505) Signed-off-by: Joe Runde <[email protected]>

…ject#13505) Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

…ject#13505) Signed-off-by: Joe Runde <[email protected]>

🔧 Add env var to disable guided decoding fallbacks

65cff01

Signed-off-by: Joe Runde <[email protected]>

joerunde requested review from DarkLight1337, robertgshaw2-redhat, simon-mo and mgoin as code owners February 19, 2025 00:19

mergify bot added the structured-output label Feb 19, 2025

markmc reviewed Feb 19, 2025

View reviewed changes

mergify bot added the needs-rebase label Feb 19, 2025

russellb mentioned this pull request Feb 19, 2025

[Bugfix] Backend option to disable xgrammar any_whitespace #12744

Merged

mgoin self-assigned this Feb 19, 2025

joerunde added 2 commits February 19, 2025 14:29

⏪ revert envs change

c971da7

Signed-off-by: Joe Runde <[email protected]>

✨ add guided decoding backend options

15cac0c

Signed-off-by: Joe Runde <[email protected]>

mergify bot removed the needs-rebase label Feb 19, 2025

joerunde added 3 commits February 19, 2025 15:07

🐛 handle missing backend name

f9d0e9d

Signed-off-by: Joe Runde <[email protected]>

🐛 fixup options

c64df44

Signed-off-by: Joe Runde <[email protected]>

📝 add docs and example

a8e73c3

Signed-off-by: Joe Runde <[email protected]>

mergify bot added the documentation Improvements or additions to documentation label Feb 19, 2025

joerunde changed the title ~~[Frontend] Add environment variable to disable guided decoding fallbacks~~ [Frontend] Add backend-specific options for guided decoding Feb 20, 2025

joerunde added the ready ONLY add when PR is ready to merge/full CI is needed label Feb 20, 2025

russellb reviewed Feb 20, 2025

View reviewed changes

russellb approved these changes Feb 20, 2025

View reviewed changes

✨ add CLI support

85b1558

Signed-off-by: Joe Runde <[email protected]>

wallashss added a commit to wallashss/vllm that referenced this pull request Feb 20, 2025

updated to use backend options from vllm-project#13505

2c29f4e

Signed-off-by: Wallas Santos <[email protected]>

aarnphm approved these changes Feb 20, 2025

View reviewed changes

mgoin approved these changes Feb 20, 2025

View reviewed changes

mgoin merged commit bfbc0b3 into vllm-project:main Feb 20, 2025
48 checks passed

joerunde deleted the no-gd-fallback branch February 20, 2025 20:22

Akshat-Tripathi pushed a commit to krai/vllm that referenced this pull request Mar 3, 2025

[Frontend] Add backend-specific options for guided decoding (vllm-pro…

07f3c1e

…ject#13505) Signed-off-by: Joe Runde <[email protected]>

lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025

[Frontend] Add backend-specific options for guided decoding (vllm-pro…

0a41517

…ject#13505) Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>

ckhordiasma mentioned this pull request Apr 17, 2025

[do not merge] pr test for nm changes into 2.20 red-hat-data-services/vllm#107

Closed

shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025

[Frontend] Add backend-specific options for guided decoding (vllm-pro…

5a84d61

…ject#13505) Signed-off-by: Joe Runde <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend] Add backend-specific options for guided decoding #13505

[Frontend] Add backend-specific options for guided decoding #13505

Uh oh!

joerunde commented Feb 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Feb 19, 2025

Uh oh!

markmc Feb 19, 2025

Uh oh!

mgoin Feb 19, 2025

Uh oh!

russellb Feb 19, 2025

Uh oh!

russellb Feb 19, 2025

Uh oh!

wallashss Feb 19, 2025

Uh oh!

joerunde Feb 19, 2025

Uh oh!

joerunde Feb 19, 2025

Uh oh!

mergify bot commented Feb 19, 2025

Uh oh!

russellb Feb 20, 2025

Uh oh!

russellb left a comment

Uh oh!

joerunde commented Feb 20, 2025

Uh oh!

mgoin left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Frontend] Add backend-specific options for guided decoding #13505

[Frontend] Add backend-specific options for guided decoding #13505

Uh oh!

Conversation

joerunde commented Feb 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

russellb left a comment

Choose a reason for hiding this comment

Uh oh!

joerunde commented Feb 20, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joerunde commented Feb 19, 2025 •

edited by github-actions bot

Loading