[Perf] Optimize MRoPR position preparing performance with numba #16881

imkero · 2025-04-19T21:23:55Z

What this PR do

This PR aims at optimizing Qwen2/2.5-VL/Omni's M-RoPE position seq generation using numba.

Optimizing get_input_positions: will reduce CPU overhead scheduling new request for those models. (especially when prefix cached, because M-RoPE position ids are not cached and we need to generate full pos seq even though the prefix key-values have been cached).
Optimizing get_next_input_positions: will reduce CPU overhead scheduling new tokens for running requests. (Pretty thanks @vadiklyutiy for pointing out the time comsumption of get_next_input_positions_tensor)

Also, the rewritten numba impl provides better readability because they are free to write in element-by-element style, and free to call sub-routines.

Besides:

To make the code more clear and easier to read / test, I move the mrope-position code from rotary_embedding.py::MRotaryEmbedding to a new separated file mrope_positions.py. (they indeed not related to the rotary_position nn modules)
To fully utilize the power of numba, I adjust the time when calling get_input_positions for mrope in V1 GPU model runner. (delayed to the chance we convert the list[int] input_ids to numpy ndarray, because numba can operate numpy array fast, but not the python builtin list)
The numba dependency have been moved to requirements/common.txt
Add some unit tests to check the functionality and correctness of the newly added numba impl

Notes

the numba optimized functions are JIT lazy-compiling, and are expected to compile at the first time they are called while profiling.
the original torch impl of MRotaryEmbedding::get_input_positions_tensor are kept for reference and unit tests usage. a use_numba kwarg controls this dispatch (the default value of use_numba is True)

Benchmark

Piecewise `get_input_positions`

benchmark script: https://gist.github.com/imkero/5329df6fe2929ff7de210e689889036d

Following result is tested on AMD EPYC 7763 with 4 cores, use vLLM V1.

Timing on the mrope_get_input_positions_and_delta function.

Qwen2/2.5-VL

test case	num tokens	main branch	this PR
plain text	360 tokens	0.130 ms	0.007 ms
1 image	324 image tokens	0.232 ms	0.007 ms
10 images	3240 image tokens	1.581 ms	0.013 ms
1 large image	1296 image tokens	0.364 ms	0.008 ms
1 video (20 frames)	3240 video tokens	0.584 ms	0.011 ms
1 video (200 frames)	32400 video tokens	4.451 ms	0.047 ms

Qwen2.5-Omni

test case	num tokens	main branch	this PR
plain text	360 tokens	4.757 ms	0.010 ms
1 image	324 image tokens	0.271 ms	0.010 ms
10 images	3240 image tokens	2.073 ms	0.016 ms
1 large image	1296 image tokens	0.268 ms	0.011 ms
1 video (20 frames)	3240 video tokens	0.277 ms	0.014 ms
1 video (200 frames)	32400 video tokens	0.535 ms	0.050 ms
1 audio	200 audio tokens	0.149 ms	0.018 ms
1 video with audio	1296 video tokens + 200 audio tokens	6.742 ms	0.012 ms

Piecewise `get_next_input_positions`

benchmark script: https://gist.github.com/imkero/9ba1cd44e1d1ba53c4b2f3e00f6a2363

result:

run_torch: 1.990s
run_numba: 0.066s
run_torch_lru: 0.872s

E2E (Qwen2.5-VL-3B)

Tested on NVIDIA A10

benchmark command:

vllm serve Qwen/Qwen2.5-VL-3B-Instruct --disable-log-requests --max-num-seqs 128 --block-size 16 --max-num-batched-tokens 8192 --max-model-len 8192 --no-enable-prefix-caching

fib benchmark -rps 50 --input-token-distribution uniform 200 200 --output-token-distribution uniform 200 200 --num-of-imgs-per-req 1 --img-ratios-per-req 512x512 -n 1024 --base-url http://localhost:8000 --endpoint v1/chat/completions --backend openai-chat

code base	duration(s)	req/s	input tokens/s	output tokens/s
main branch	172.86	5.92	3231.29	1184.74
PR #17617 🚀	171.74	5.96	3252.25	1192.53
this PR (input_positions optimized)	172.63	5.93	3235.66	1186.34
this PR 🚀🚀 (input_positions and next_input_positions optimized)	170.06	6.02	3284.35	1204.25

github-actions · 2025-04-19T21:24:04Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-04-24T14:29:44Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2025-04-26T17:42:50Z

@imkero Thanks for the PR! This is amazing 🚀
Could you please resolve the merge conflict and the lint error?

imkero · 2025-04-26T18:18:56Z

@imkero Thanks for the PR! This is amazing 🚀
Could you please resolve the merge conflict and the lint error?

Sure! I will update this PR soon.

TODO:

fix merge conflicts & lint error
complete the PR description
enhance the error msg in assertions
add unit tests for assertions
refresh the benchmark result
e2e correctness test

And I will share my experience of using numba later.

Signed-off-by: imkero <[email protected]>

WoosukKwon · 2025-04-27T19:24:15Z

@imkero Just so you know: To fix the CI failure, we should move numba from requirements/cuda.txt and requirements/rocm.txt to requirements/common.txt.

Signed-off-by: imkero <[email protected]>

imkero · 2025-05-01T15:40:23Z

@imkero Just so you know: To fix the CI failure, we should move numba from requirements/cuda.txt and requirements/rocm.txt to requirements/common.txt.

Thanks for your remind, I have moved it to requirements/common.txt now.

Signed-off-by: imkero <[email protected]>

imkero · 2025-05-01T16:17:42Z

@WoosukKwon I think this PR is ready for review now

mergify · 2025-05-02T08:08:50Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: imkero <[email protected]>

imkero · 2025-05-05T06:17:58Z

I will take get_next_positions_tensor into consideration because they are reported to be time-costing as well in #17617

imkero · 2025-05-05T07:21:35Z

I have written an optimized version of get_next_input_positions_tensor which is zero copy and no lru cache.

Will add e2e benchmarking result like #17617

Signed-off-by: imkero <[email protected]>

vllm/v1/worker/gpu_model_runner.py

Signed-off-by: imkero <[email protected]>

imkero · 2025-05-05T11:58:18Z

This PR continues the idea of #17617. Thanks @vadiklyutiy

Could you please take a look? @ywang96

Signed-off-by: imkero <[email protected]>

mergify · 2025-05-08T05:40:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-05-13T11:18:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @imkero.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

imkero requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 19, 2025 21:23

imkero marked this pull request as draft April 19, 2025 21:24

mergify bot added the v1 label Apr 19, 2025

mergify bot added needs-rebase and removed needs-rebase labels Apr 24, 2025

feat: optimize mrope pos calculation perf

fc1c397

Signed-off-by: imkero <[email protected]>

imkero force-pushed the feat/mrope-pos-perf branch from abb1d2d to fc1c397 Compare April 27, 2025 16:57

fix

59ee3c4

Signed-off-by: imkero <[email protected]>

imkero force-pushed the feat/mrope-pos-perf branch from 61b5d69 to 59ee3c4 Compare April 27, 2025 17:32

imkero added 2 commits April 27, 2025 17:52

fix

ee5f247

Signed-off-by: imkero <[email protected]>

fix

85d6148

Signed-off-by: imkero <[email protected]>

move numba deps to requirements/common.txt

740a107

Signed-off-by: imkero <[email protected]>

mergify bot added the ci/build label Apr 29, 2025

imkero added 2 commits April 29, 2025 20:09

cleanup

42ba4b6

Signed-off-by: imkero <[email protected]>

cleanup

6e23de7

Signed-off-by: imkero <[email protected]>

fix second_per_grid_t

5cd0e0e

Signed-off-by: imkero <[email protected]>

imkero marked this pull request as ready for review May 1, 2025 16:17

mergify bot added the needs-rebase label May 2, 2025

Merge remote-tracking branch 'upstream/main' into feat/mrope-pos-perf

03fd9c7

Signed-off-by: imkero <[email protected]>

mergify bot removed the needs-rebase label May 2, 2025

imkero mentioned this pull request May 5, 2025

[PERF] Speed up of prepare_inputs / mrope #17617

Draft

optimize get_next_input_positions_tensor

d5eb8f4

Signed-off-by: imkero <[email protected]>

imkero changed the title ~~[Perf] Optimize MRotaryEmbedding::get_input_positions performance by numba~~ [Perf] Optimize MRoPR position preparing performance with numba May 5, 2025

vadiklyutiy reviewed May 5, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Show resolved Hide resolved

add comments

0e7f6e0

Signed-off-by: imkero <[email protected]>

imkero force-pushed the feat/mrope-pos-perf branch from fa2cf59 to 0e7f6e0 Compare May 5, 2025 11:55

cleanup

fdf5463

Signed-off-by: imkero <[email protected]>

imkero force-pushed the feat/mrope-pos-perf branch from cb1f02f to fdf5463 Compare May 6, 2025 05:59

mergify bot added the needs-rebase label May 8, 2025

Merge remote-tracking branch 'upstream/main' into feat/mrope-pos-perf

e71dabd

mergify bot removed the needs-rebase label May 10, 2025

mergify bot added the needs-rebase label May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf] Optimize MRoPR position preparing performance with numba #16881

[Perf] Optimize MRoPR position preparing performance with numba #16881

Uh oh!

imkero commented Apr 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Apr 19, 2025

Uh oh!

mergify bot commented Apr 24, 2025

Uh oh!

WoosukKwon commented Apr 26, 2025

Uh oh!

imkero commented Apr 26, 2025 •

edited

Loading

Uh oh!

WoosukKwon commented Apr 27, 2025

Uh oh!

imkero commented May 1, 2025

Uh oh!

imkero commented May 1, 2025

Uh oh!

mergify bot commented May 2, 2025

Uh oh!

imkero commented May 5, 2025

Uh oh!

imkero commented May 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

imkero commented May 5, 2025

Uh oh!

mergify bot commented May 8, 2025

Uh oh!

mergify bot commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

[Perf] Optimize MRoPR position preparing performance with numba #16881

Are you sure you want to change the base?

[Perf] Optimize MRoPR position preparing performance with numba #16881

Uh oh!

Conversation

imkero commented Apr 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR do

Notes

Benchmark

Piecewise get_input_positions

Qwen2/2.5-VL

Qwen2.5-Omni

Piecewise get_next_input_positions

E2E (Qwen2.5-VL-3B)

Uh oh!

github-actions bot commented Apr 19, 2025

Uh oh!

mergify bot commented Apr 24, 2025

Uh oh!

WoosukKwon commented Apr 26, 2025

Uh oh!

imkero commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Apr 27, 2025

Uh oh!

imkero commented May 1, 2025

Uh oh!

imkero commented May 1, 2025

Uh oh!

mergify bot commented May 2, 2025

Uh oh!

imkero commented May 5, 2025

Uh oh!

imkero commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

imkero commented May 5, 2025

Uh oh!

mergify bot commented May 8, 2025

Uh oh!

mergify bot commented May 13, 2025

Uh oh!

Uh oh!

imkero commented Apr 19, 2025 •

edited by github-actions bot

Loading

Piecewise `get_input_positions`

Piecewise `get_next_input_positions`

imkero commented Apr 26, 2025 •

edited

Loading

imkero commented May 5, 2025 •

edited

Loading