[Sampler] Adapt to FlashInfer 0.2.3 sampler API #15777

abmfy · 2025-03-30T18:06:37Z

FlashInfer 0.2.3 introduced some breaking changes to its sampler API, this PR updates the calling sites in vLLM to adapt to the update.

FIX #14815
FIX #15666

Signed-off-by: Bowen Wang <[email protected]>

github-actions · 2025-03-30T18:06:47Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

yzh119 · 2025-03-30T20:42:39Z

Hi @youkaichao , are failed test related to this PR?

I saw some error message like this:


[2025-03-30T18:53:05Z] INFO 03-30 11:53:05 [backends.py:144] Compiling a graph for general shape takes 25.10 s
--
  | [2025-03-30T18:53:05Z] DEBUG 03-30 11:53:05 [backends.py:469] Computation graph saved to /root/.cache/vllm/torch_compile_cache/da20f97f50/rank_0_0/computation_graph.py
  | [2025-03-30T18:53:05Z] DEBUG 03-30 11:53:05 [wrapper.py:105] Dynamo transformed code saved to /root/.cache/vllm/torch_compile_cache/da20f97f50/rank_0_0/transformed_code.py
  | [2025-03-30T18:53:16Z] INFO 03-30 11:53:16 [monitor.py:33] torch.compile takes 30.91 s in total
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367] EngineCore hit an exception: Traceback (most recent call last):
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 355, in run_engine_core
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     engine_core = EngineCoreProc(*args, **kwargs)
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 296, in __init__
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     super().__init__(vllm_config, executor_class, log_stats)
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 67, in __init__
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 615, in get_kv_cache_config
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 489, in check_enough_kv_cache_memory
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     raise ValueError(
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367] ValueError: To serve at least one request with the models's max seq len (8192), (0.81 GB KV cache is needed, which is larger than the available KV cache memory (0.55 GB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]
  | [2025-03-30T18:53:17Z] CRITICAL 03-30 11:53:17 [core_client.py:319] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
  | [2025-03-30T18:53:17Z] bash: line 1:   510 Killed                  pytest -v -s basic_correctness/test_basic_correctness.py

youkaichao

do you need to update the dockerfile to install the new flashinfer wheel?

abmfy · 2025-03-31T09:02:53Z

do you need to update the dockerfile to install the new flashinfer wheel?

Let me have a try

Signed-off-by: Bowen Wang <[email protected]>

mergify · 2025-04-01T08:22:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @abmfy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mgoin · 2025-04-22T17:20:07Z

Can we revive this? I would like to update flashinfer to latest now that we have it integrated with V1 as an attention backend

WoosukKwon · 2025-04-22T17:25:28Z

@mgoin +1. Let's update this.

@abmfy Sorry for the late review. Could you please add an accuracy test for the new kernel?

abmfy · 2025-04-23T04:34:11Z

Sure, I'll add some tests soon

chenyang78 · 2025-04-23T17:09:17Z

FYI - synced with @mgoin @luccafong @houseroad offline, there was some numerical issue with flashinfer 0.2.5. Fortunately, we found that the issue was fixed already by the upstream (flashinfer-ai/flashinfer#1029)

before the fix:

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0675|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.0629|±  |0.0067|

after the fix:

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7892|±  |0.0112|
|     |       |strict-match    |     5|exact_match|↑  |0.7650|±  |0.0117|

mgoin · 2025-04-29T14:40:55Z

Hi @abmfy do you plan to have test updates soon? We can help make them if you don't have time right now

abmfy · 2025-04-30T11:53:27Z

Hi @abmfy do you plan to have test updates soon? We can help make them if you don't have time right now

Hi, sorry for the delayed response. I’ve just settled in the U.S. recently and was dealing with limited bandwidth. I’ll focus on adding the test today.

mergify · 2025-05-02T03:47:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @abmfy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon

LGTM. Thanks a lot for the PR!

mgoin · 2025-05-14T16:08:44Z

docker/Dockerfile

-    # TESTING: install FlashInfer from source to test 2.7.0 final RC
-    FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX' \
-    uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/[email protected]" ; \
+    uv pip install --system https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.4/flashinfer_python-0.2.4+cu124torch2.6-cp38-abi3-linux_x86_64.whl ; \


I don't think we can use this wheel since it is built for cu124torch2.6. We need cu128 and torch 2.8

Sure. However, FlashInfer 0.2.4 only provides wheels up to cu124torch2.6, so we may need to build from source in CI for now—at least until a new release of FlashInfer becomes available.

Signed-off-by: Bowen Wang <[email protected]>

abmfy · 2025-05-14T22:31:51Z

Some sampler tests of v0 seem to be failing due to the removal of the ability to pass pre-generated uniform samples to FlashInfer kernels.

v1 sampler tests all passed. Will get back to this later.

simon-mo · 2025-05-15T20:14:41Z

@abmfy Thank you can you fix this PR? This is part of the release blocker now.

Signed-off-by: Bowen Wang <[email protected]>

Since FlashInfer 0.2.3 removed the ability to pass in uniform samples. Signed-off-by: Bowen Wang <[email protected]>

abmfy · 2025-05-16T06:33:35Z

@abmfy Thank you can you fix this PR? This is part of the release blocker now.

Sure, I'll fix the tests that seems to be related to this PR

Signed-off-by: Bowen Wang <[email protected]>

…#15777)" This reverts commit 7fdfa01. Signed-off-by: Mark McLoughlin <[email protected]>

DarkLight1337 · 2025-05-21T13:54:46Z

Thanks for the contribution, however your PR appears to have broken many tests (see #18462), can those involved fix them to unblock the release?

abmfy · 2025-05-21T17:10:39Z

Thanks for the contribution, however your PR appears to have broken many tests (see #18462), can those involved fix them to unblock the release?

That's strange, will look into it

DarkLight1337 · 2025-05-21T17:13:01Z

See #18416 #18417 for more info

Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Chenheli Hua <[email protected]>

Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

[Sampler] Adapt to FlashInfer 0.2.3 sampler API

3b3a25e

Signed-off-by: Bowen Wang <[email protected]>

abmfy requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, zhuohan123 and youkaichao as code owners March 30, 2025 18:06

mergify bot added the v1 label Mar 30, 2025

youkaichao reviewed Mar 31, 2025

View reviewed changes

[CI] Bump FlashInfer to 0.2.4

f0e99e6

Signed-off-by: Bowen Wang <[email protected]>

mergify bot added the ci/build label Mar 31, 2025

mergify bot added the needs-rebase label Apr 1, 2025

Merge branch 'main' into flashinfer-sampling-api

2d22d0c

mergify bot removed the needs-rebase label Apr 22, 2025

mergify bot added the needs-rebase label May 2, 2025

abmfy force-pushed the flashinfer-sampling-api branch 2 times, most recently from 9a5040b to c159f4e Compare May 2, 2025 03:51

WoosukKwon approved these changes May 14, 2025

View reviewed changes

WoosukKwon enabled auto-merge (squash) May 14, 2025 15:47

mgoin reviewed May 14, 2025

View reviewed changes

[CI] Build FlashInfer from source, maybe due to PyTorch -> 2.7?

8184f92

Signed-off-by: Bowen Wang <[email protected]>

auto-merge was automatically disabled May 14, 2025 16:15
Head branch was pushed to by a user without write access

[Bugfix] Fix tensor types in rejection sampler

8c49c67

Signed-off-by: Bowen Wang <[email protected]>

simon-mo added this to the v0.9.0 milestone May 15, 2025

abmfy added 2 commits May 15, 2025 23:15

[Bugfix] Fix return value assignment in FlashInfer rejection sampler

c1cf1bf

Signed-off-by: Bowen Wang <[email protected]>

[Test] Disable some tests for FlashInfer spec sampling

ea483ac

Since FlashInfer 0.2.3 removed the ability to pass in uniform samples. Signed-off-by: Bowen Wang <[email protected]>

abmfy added 4 commits May 16, 2025 00:28

[Sampler] Fall back to native sampling when specified generators

2acb6c5

Signed-off-by: Bowen Wang <[email protected]>

[Test] Disable FlashInfer fallbacking test

6c464a1

Signed-off-by: Bowen Wang <[email protected]>

[Sampler] Fallback to native sampling for v0 when seeded

fce37d8

Signed-off-by: Bowen Wang <[email protected]>

Merge branch 'main' into flashinfer-sampling-api

cadf861

Signed-off-by: Bowen Wang <[email protected]>

simon-mo merged commit 7fdfa01 into vllm-project:main May 16, 2025
89 of 93 checks passed

EduardDurech mentioned this pull request May 18, 2025

veRL-SGLang slower than expected (GH200) volcengine/verl#1208

Open

markmc mentioned this pull request May 20, 2025

[Bug][Failing Test] entrypoints-test - test_v1_v2_api_consistency_single_prompt_tokens #18418

Closed

markmc added a commit to markmc/vllm that referenced this pull request May 21, 2025

Revert "[Sampler] Adapt to FlashInfer 0.2.3 sampler API (vllm-project…

29d9344

…#15777)" This reverts commit 7fdfa01. Signed-off-by: Mark McLoughlin <[email protected]>

This was referenced May 21, 2025

[DO NOT MERGE] Revert to pre #15777 #18462

Closed

[Bug][Failing Test] distributed tests (4 GPUS) - v1/test_async_llm_dp.py::test_load #18466

Closed

DarkLight1337 mentioned this pull request May 21, 2025

[Bug][Failing Test]: LoRA 2 - lora/test_lora_functions.py::test_lora_functions_sync #18498

Closed

1 task

DarkLight1337 mentioned this pull request May 22, 2025

[Bug][Failing Test]: V1 - v1/entrypoints/llm/test_struct_output_generate.py #18525

Closed

1 task

abmfy mentioned this pull request May 22, 2025

[Bugfix] Use random hidden states in dummy sampler run #18543

Merged

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

[Sampler] Adapt to FlashInfer 0.2.3 sampler API (vllm-project#15777)

7fa41b4

Signed-off-by: Bowen Wang <[email protected]> Co-authored-by: mgoin <[email protected]> Signed-off-by: Yuqi Zhang <[email protected]>

Uh oh!

[Sampler] Adapt to FlashInfer 0.2.3 sampler API #15777

[Sampler] Adapt to FlashInfer 0.2.3 sampler API #15777

Uh oh!

Conversation

abmfy commented Mar 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 30, 2025

Uh oh!

yzh119 commented Mar 30, 2025

Uh oh!

youkaichao left a comment

Choose a reason for hiding this comment

Uh oh!

abmfy commented Mar 31, 2025

Uh oh!

mergify bot commented Apr 1, 2025

Uh oh!

mgoin commented Apr 22, 2025

Uh oh!

WoosukKwon commented Apr 22, 2025

Uh oh!

abmfy commented Apr 23, 2025

Uh oh!

chenyang78 commented Apr 23, 2025

Uh oh!

mgoin commented Apr 29, 2025

Uh oh!

abmfy commented Apr 30, 2025

Uh oh!

mergify bot commented May 2, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin May 14, 2025

Choose a reason for hiding this comment

Uh oh!

abmfy May 14, 2025

Choose a reason for hiding this comment

Uh oh!

abmfy commented May 14, 2025

Uh oh!

simon-mo commented May 15, 2025

Uh oh!

abmfy commented May 16, 2025

Uh oh!

Uh oh!

DarkLight1337 commented May 21, 2025

Uh oh!

abmfy commented May 21, 2025

Uh oh!

DarkLight1337 commented May 21, 2025

Uh oh!

Uh oh!

abmfy commented Mar 30, 2025 •

edited by github-actions bot

Loading