-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Perf] Optimize MRoPR position preparing performance with numba #16881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This pull request has merge conflicts that must be resolved before it can be |
@imkero Thanks for the PR! This is amazing 🚀 |
Sure! I will update this PR soon. TODO:
And I will share my experience of using numba later. |
Signed-off-by: imkero <[email protected]>
abb1d2d
to
fc1c397
Compare
Signed-off-by: imkero <[email protected]>
61b5d69
to
59ee3c4
Compare
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
@imkero Just so you know: To fix the CI failure, we should move |
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
Thanks for your remind, I have moved it to |
Signed-off-by: imkero <[email protected]>
@WoosukKwon I think this PR is ready for review now |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: imkero <[email protected]>
I will take get_next_positions_tensor into consideration because they are reported to be time-costing as well in #17617 |
I have written an optimized version of Will add e2e benchmarking result like #17617 |
Signed-off-by: imkero <[email protected]>
Signed-off-by: imkero <[email protected]>
fa2cf59
to
0e7f6e0
Compare
This PR continues the idea of #17617. Thanks @vadiklyutiy Could you please take a look? @ywang96 |
Signed-off-by: imkero <[email protected]>
cb1f02f
to
fdf5463
Compare
This pull request has merge conflicts that must be resolved before it can be |
This pull request has merge conflicts that must be resolved before it can be |
What this PR do
This PR aims at optimizing Qwen2/2.5-VL/Omni's M-RoPE position seq generation using numba.
get_input_positions
: will reduce CPU overhead scheduling new request for those models. (especially when prefix cached, because M-RoPE position ids are not cached and we need to generate full pos seq even though the prefix key-values have been cached).get_next_input_positions
: will reduce CPU overhead scheduling new tokens for running requests. (Pretty thanks @vadiklyutiy for pointing out the time comsumption ofget_next_input_positions_tensor
)Also, the rewritten numba impl provides better readability because they are free to write in element-by-element style, and free to call sub-routines.
Besides:
rotary_embedding.py::MRotaryEmbedding
to a new separated filemrope_positions.py
. (they indeed not related to the rotary_position nn modules)get_input_positions
for mrope in V1 GPU model runner. (delayed to the chance we convert the list[int] input_ids to numpy ndarray, because numba can operate numpy array fast, but not the python builtin list)numba
dependency have been moved torequirements/common.txt
Notes
MRotaryEmbedding::get_input_positions_tensor
are kept for reference and unit tests usage. ause_numba
kwarg controls this dispatch (the default value ofuse_numba
is True)Benchmark
Piecewise
get_input_positions
benchmark script: https://gist.github.com/imkero/5329df6fe2929ff7de210e689889036d
Qwen2/2.5-VL
Qwen2.5-Omni
Piecewise
get_next_input_positions
benchmark script: https://gist.github.com/imkero/9ba1cd44e1d1ba53c4b2f3e00f6a2363
result:
E2E (Qwen2.5-VL-3B)
benchmark command:
(input_positions optimized)
(input_positions and next_input_positions optimized)