Skip to content

[Perf] Optimize MRoPR position preparing performance with numba #16881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

imkero
Copy link
Contributor

@imkero imkero commented Apr 19, 2025

What this PR do

This PR aims at optimizing Qwen2/2.5-VL/Omni's M-RoPE position seq generation using numba.

  • Optimizing get_input_positions: will reduce CPU overhead scheduling new request for those models. (especially when prefix cached, because M-RoPE position ids are not cached and we need to generate full pos seq even though the prefix key-values have been cached).
  • Optimizing get_next_input_positions: will reduce CPU overhead scheduling new tokens for running requests. (Pretty thanks @vadiklyutiy for pointing out the time comsumption of get_next_input_positions_tensor)

Also, the rewritten numba impl provides better readability because they are free to write in element-by-element style, and free to call sub-routines.

Besides:

  • To make the code more clear and easier to read / test, I move the mrope-position code from rotary_embedding.py::MRotaryEmbedding to a new separated file mrope_positions.py. (they indeed not related to the rotary_position nn modules)
  • To fully utilize the power of numba, I adjust the time when calling get_input_positions for mrope in V1 GPU model runner. (delayed to the chance we convert the list[int] input_ids to numpy ndarray, because numba can operate numpy array fast, but not the python builtin list)
  • The numba dependency have been moved to requirements/common.txt
  • Add some unit tests to check the functionality and correctness of the newly added numba impl

Notes

  • the numba optimized functions are JIT lazy-compiling, and are expected to compile at the first time they are called while profiling.
  • the original torch impl of MRotaryEmbedding::get_input_positions_tensor are kept for reference and unit tests usage. a use_numba kwarg controls this dispatch (the default value of use_numba is True)

Benchmark

Piecewise get_input_positions

benchmark script: https://gist.github.com/imkero/5329df6fe2929ff7de210e689889036d

Following result is tested on AMD EPYC 7763 with 4 cores, use vLLM V1.

Timing on the mrope_get_input_positions_and_delta function.

Qwen2/2.5-VL

test case num tokens main branch this PR
plain text 360 tokens 0.130 ms 0.007 ms
1 image 324 image tokens 0.232 ms 0.007 ms
10 images 3240 image tokens 1.581 ms 0.013 ms
1 large image 1296 image tokens 0.364 ms 0.008 ms
1 video (20 frames) 3240 video tokens 0.584 ms 0.011 ms
1 video (200 frames) 32400 video tokens 4.451 ms 0.047 ms

Qwen2.5-Omni

test case num tokens main branch this PR
plain text 360 tokens 4.757 ms 0.010 ms
1 image 324 image tokens 0.271 ms 0.010 ms
10 images 3240 image tokens 2.073 ms 0.016 ms
1 large image 1296 image tokens 0.268 ms 0.011 ms
1 video (20 frames) 3240 video tokens 0.277 ms 0.014 ms
1 video (200 frames) 32400 video tokens 0.535 ms 0.050 ms
1 audio 200 audio tokens 0.149 ms 0.018 ms
1 video with audio 1296 video tokens + 200 audio tokens 6.742 ms 0.012 ms

Piecewise get_next_input_positions

benchmark script: https://gist.github.com/imkero/9ba1cd44e1d1ba53c4b2f3e00f6a2363

result:

run_torch: 1.990s
run_numba: 0.066s
run_torch_lru: 0.872s

E2E (Qwen2.5-VL-3B)

Tested on NVIDIA A10

benchmark command:

vllm serve Qwen/Qwen2.5-VL-3B-Instruct --disable-log-requests --max-num-seqs 128 --block-size 16 --max-num-batched-tokens 8192 --max-model-len 8192 --no-enable-prefix-caching
fib benchmark -rps 50 --input-token-distribution uniform 200 200 --output-token-distribution uniform 200 200 --num-of-imgs-per-req 1 --img-ratios-per-req 512x512 -n 1024 --base-url http://localhost:8000 --endpoint v1/chat/completions --backend openai-chat
code base duration(s) req/s input tokens/s output tokens/s
main branch 172.86 5.92 3231.29 1184.74
PR #17617 🚀 171.74 5.96 3252.25 1192.53
this PR
(input_positions optimized)
172.63 5.93 3235.66 1186.34
this PR 🚀🚀
(input_positions and next_input_positions optimized)
170.06 6.02 3284.35 1204.25

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@imkero imkero marked this pull request as draft April 19, 2025 21:24
@mergify mergify bot added the v1 label Apr 19, 2025
Copy link

mergify bot commented Apr 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@WoosukKwon
Copy link
Collaborator

@imkero Thanks for the PR! This is amazing 🚀
Could you please resolve the merge conflict and the lint error?

@imkero
Copy link
Contributor Author

imkero commented Apr 26, 2025

@imkero Thanks for the PR! This is amazing 🚀
Could you please resolve the merge conflict and the lint error?

Sure! I will update this PR soon.

TODO:

  • fix merge conflicts & lint error
  • complete the PR description
  • enhance the error msg in assertions
  • add unit tests for assertions
  • refresh the benchmark result
  • e2e correctness test

And I will share my experience of using numba later.

@imkero imkero force-pushed the feat/mrope-pos-perf branch from abb1d2d to fc1c397 Compare April 27, 2025 16:57
Signed-off-by: imkero <[email protected]>
@imkero imkero force-pushed the feat/mrope-pos-perf branch from 61b5d69 to 59ee3c4 Compare April 27, 2025 17:32
imkero added 2 commits April 27, 2025 17:52
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
@WoosukKwon
Copy link
Collaborator

@imkero Just so you know: To fix the CI failure, we should move numba from requirements/cuda.txt and requirements/rocm.txt to requirements/common.txt.

@mergify mergify bot added the ci/build label Apr 29, 2025
imkero added 2 commits April 29, 2025 20:09
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
@imkero
Copy link
Contributor Author

imkero commented May 1, 2025

@imkero Just so you know: To fix the CI failure, we should move numba from requirements/cuda.txt and requirements/rocm.txt to requirements/common.txt.

Thanks for your remind, I have moved it to requirements/common.txt now.

@imkero
Copy link
Contributor Author

imkero commented May 1, 2025

@WoosukKwon I think this PR is ready for review now

@imkero imkero marked this pull request as ready for review May 1, 2025 16:17
Copy link

mergify bot commented May 2, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 2, 2025
@imkero
Copy link
Contributor Author

imkero commented May 5, 2025

I will take get_next_positions_tensor into consideration because they are reported to be time-costing as well in #17617

@imkero
Copy link
Contributor Author

imkero commented May 5, 2025

I have written an optimized version of get_next_input_positions_tensor which is zero copy and no lru cache.

Will add e2e benchmarking result like #17617

@imkero imkero changed the title [Perf] Optimize MRotaryEmbedding::get_input_positions performance by numba [Perf] Optimize MRoPR position preparing performance with numba May 5, 2025
Signed-off-by: imkero <[email protected]>
@imkero imkero force-pushed the feat/mrope-pos-perf branch from fa2cf59 to 0e7f6e0 Compare May 5, 2025 11:55
@imkero
Copy link
Contributor Author

imkero commented May 5, 2025

This PR continues the idea of #17617. Thanks @vadiklyutiy

Could you please take a look? @ywang96

Signed-off-by: imkero <[email protected]>
@imkero imkero force-pushed the feat/mrope-pos-perf branch from cb1f02f to fdf5463 Compare May 6, 2025 05:59
Copy link

mergify bot commented May 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 8, 2025
@mergify mergify bot removed the needs-rebase label May 10, 2025
Copy link

mergify bot commented May 13, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants