Skip to content

[Sampler] Adapt to FlashInfer 0.2.3 sampler API #15777

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 16, 2025

Conversation

abmfy
Copy link
Member

@abmfy abmfy commented Mar 30, 2025

FlashInfer 0.2.3 introduced some breaking changes to its sampler API, this PR updates the calling sites in vLLM to adapt to the update.

FIX #14815
FIX #15666

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 30, 2025
@yzh119
Copy link

yzh119 commented Mar 30, 2025

Hi @youkaichao , are failed test related to this PR?

I saw some error message like this:


[2025-03-30T18:53:05Z] INFO 03-30 11:53:05 [backends.py:144] Compiling a graph for general shape takes 25.10 s
--
  | [2025-03-30T18:53:05Z] DEBUG 03-30 11:53:05 [backends.py:469] Computation graph saved to /root/.cache/vllm/torch_compile_cache/da20f97f50/rank_0_0/computation_graph.py
  | [2025-03-30T18:53:05Z] DEBUG 03-30 11:53:05 [wrapper.py:105] Dynamo transformed code saved to /root/.cache/vllm/torch_compile_cache/da20f97f50/rank_0_0/transformed_code.py
  | [2025-03-30T18:53:16Z] INFO 03-30 11:53:16 [monitor.py:33] torch.compile takes 30.91 s in total
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367] EngineCore hit an exception: Traceback (most recent call last):
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 355, in run_engine_core
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     engine_core = EngineCoreProc(*args, **kwargs)
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 296, in __init__
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     super().__init__(vllm_config, executor_class, log_stats)
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 67, in __init__
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 133, in _initialize_kv_caches
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 615, in get_kv_cache_config
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     check_enough_kv_cache_memory(vllm_config, kv_cache_spec, available_memory)
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 489, in check_enough_kv_cache_memory
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]     raise ValueError(
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367] ValueError: To serve at least one request with the models's max seq len (8192), (0.81 GB KV cache is needed, which is larger than the available KV cache memory (0.55 GB). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
  | [2025-03-30T18:53:17Z] ERROR 03-30 11:53:17 [core.py:367]
  | [2025-03-30T18:53:17Z] CRITICAL 03-30 11:53:17 [core_client.py:319] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
  | [2025-03-30T18:53:17Z] bash: line 1:   510 Killed                  pytest -v -s basic_correctness/test_basic_correctness.py


Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to update the dockerfile to install the new flashinfer wheel?

@abmfy
Copy link
Member Author

abmfy commented Mar 31, 2025

do you need to update the dockerfile to install the new flashinfer wheel?

Let me have a try

@mergify mergify bot added the ci/build label Mar 31, 2025
Copy link

mergify bot commented Apr 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @abmfy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 1, 2025
@mgoin
Copy link
Member

mgoin commented Apr 22, 2025

Can we revive this? I would like to update flashinfer to latest now that we have it integrated with V1 as an attention backend

@WoosukKwon
Copy link
Collaborator

@mgoin +1. Let's update this.

@abmfy Sorry for the late review. Could you please add an accuracy test for the new kernel?

@mergify mergify bot removed the needs-rebase label Apr 22, 2025
@abmfy
Copy link
Member Author

abmfy commented Apr 23, 2025

Sure, I'll add some tests soon

@chenyang78
Copy link
Contributor

FYI - synced with @mgoin @luccafong @houseroad offline, there was some numerical issue with flashinfer 0.2.5. Fortunately, we found that the issue was fixed already by the upstream (flashinfer-ai/flashinfer#1029)

before the fix:

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0675|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.0629|±  |0.0067|

after the fix:

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7892|±  |0.0112|
|     |       |strict-match    |     5|exact_match|↑  |0.7650|±  |0.0117|

@mgoin
Copy link
Member

mgoin commented Apr 29, 2025

Hi @abmfy do you plan to have test updates soon? We can help make them if you don't have time right now

@abmfy
Copy link
Member Author

abmfy commented Apr 30, 2025

Hi @abmfy do you plan to have test updates soon? We can help make them if you don't have time right now

Hi, sorry for the delayed response. I’ve just settled in the U.S. recently and was dealing with limited bandwidth. I’ll focus on adding the test today.

Copy link

mergify bot commented May 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @abmfy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 2, 2025
@abmfy abmfy force-pushed the flashinfer-sampling-api branch 2 times, most recently from 9a5040b to c159f4e Compare May 2, 2025 03:51
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks a lot for the PR!

@WoosukKwon WoosukKwon enabled auto-merge (squash) May 14, 2025 15:47
# TESTING: install FlashInfer from source to test 2.7.0 final RC
FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX' \
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/[email protected]" ; \
uv pip install --system https://github.com/flashinfer-ai/flashinfer/releases/download/v0.2.4/flashinfer_python-0.2.4+cu124torch2.6-cp38-abi3-linux_x86_64.whl ; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can use this wheel since it is built for cu124torch2.6. We need cu128 and torch 2.8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. However, FlashInfer 0.2.4 only provides wheels up to cu124torch2.6, so we may need to build from source in CI for now—at least until a new release of FlashInfer becomes available.

auto-merge was automatically disabled May 14, 2025 16:15

Head branch was pushed to by a user without write access

@abmfy
Copy link
Member Author

abmfy commented May 14, 2025

Some sampler tests of v0 seem to be failing due to the removal of the ability to pass pre-generated uniform samples to FlashInfer kernels.

v1 sampler tests all passed. Will get back to this later.

@simon-mo simon-mo added this to the v0.9.0 milestone May 15, 2025
@simon-mo
Copy link
Collaborator

@abmfy Thank you can you fix this PR? This is part of the release blocker now.

abmfy added 2 commits May 15, 2025 23:15
Since FlashInfer 0.2.3 removed the ability to pass in
uniform samples.

Signed-off-by: Bowen Wang <[email protected]>
@abmfy
Copy link
Member Author

abmfy commented May 16, 2025

@abmfy Thank you can you fix this PR? This is part of the release blocker now.

Sure, I'll fix the tests that seems to be related to this PR

@simon-mo simon-mo merged commit 7fdfa01 into vllm-project:main May 16, 2025
89 of 93 checks passed
markmc added a commit to markmc/vllm that referenced this pull request May 21, 2025
@DarkLight1337
Copy link
Member

Thanks for the contribution, however your PR appears to have broken many tests (see #18462), can those involved fix them to unblock the release?

@abmfy
Copy link
Member Author

abmfy commented May 21, 2025

Thanks for the contribution, however your PR appears to have broken many tests (see #18462), can those involved fix them to unblock the release?

That's strange, will look into it

@DarkLight1337
Copy link
Member

See #18416 #18417 for more info

huachenheli pushed a commit to huachenheli/vllm that referenced this pull request May 22, 2025
Signed-off-by: Bowen Wang <[email protected]>
Co-authored-by: mgoin <[email protected]>
Signed-off-by: Chenheli Hua <[email protected]>
zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025
Signed-off-by: Bowen Wang <[email protected]>
Co-authored-by: mgoin <[email protected]>
Signed-off-by: Yuqi Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: update to flashinfer 0.2.3
10 participants