-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[V1] Add disable_chunked_mm_input
arg to disable partial mm input prefill
#15837
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Add disable_chunked_mm_input
arg to disable partial mm input prefill
#15837
Conversation
Signed-off-by: mgoin <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
What is this change for? |
@WoosukKwon This is needed because the vllm/vllm/v1/worker/tpu_model_runner.py Lines 530 to 537 in 46c759c
The last line encoder_output[start_idx:end_idx] will slice an on-device tensor with varying shape, triggering recompilation. Padding here is non-obvious because image features have to be aligned with image placeholders in input_ids for merge_multimodal_embeddings. So I think it is natural to allow for the disabling of chunking within multimodal items.
|
This pull request has merge conflicts that must be resolved before it can be |
@mgoin Thanks for the explanation. |
Signed-off-by: mgoin <[email protected]>
Signed-off-by: mgoin <[email protected]>
disable_chunked_mm_input
arg to disable partial mm input prefill
disable_chunked_mm_input
arg to disable partial mm input prefilldisable_chunked_mm_input
arg to disable partial mm input prefill
I think we need to add a check that |
So even if #15712 is merged, it will trigger recompilation on TPU? Can we exclude |
@DarkLight1337 cc @NickLucche is this good with you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This appears to be working really well to address the TPU issue we have. Great job @mgoin !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DarkLight1337 _gather_encoder_outputs will create its own graph separate from the model forward pass. Anything that deals with tensors on device will end up creating an XLA graph, but we gain a lot by separating tricky operations that often create recompilation into smaller graphs.
Thanks for the explanation!
Sorry I'm late to this PR, but what happens if the embedding of a multimodal data item is bigger than the |
From the test it appears it's not scheduling, so we need to add that check today. |
I meant to add that before landing, sorry I didn't realize that auto merge was on. Will push this up today |
My bad, I thought the PR was already ready |
…refill (vllm-project#15837) Signed-off-by: mgoin <[email protected]>
Introduces a
disable_chunked_mm_input
argument to SchedulerConfig that can prevent partial scheduling of tokens from a multimodal input item, used in V1. If the scheduled range would only cover part of the mm input, roll back to only schedule the tokens before the mm item.This ensures that if a request has a mixed prompt (like text tokens
TTTT
followed by image tokens IIIIIIIIII) where only some image tokens can be scheduled (likeTTTTIIIII
, leavingIIIII
for the next step), it will be scheduled asTTTT
in one step andIIIIIIIIII
in the next.EDIT added context:
This is needed because the
_gather_encoder_outputs
function poses a problem in the TPU model runner when chunking through a multimodal item:vllm/vllm/v1/worker/tpu_model_runner.py
Lines 530 to 537 in 46c759c
The last line
encoder_output[start_idx:end_idx]
will slice an on-device tensor with varying shape, triggering recompilation. Padding here is non-obvious because image features have to be aligned with image placeholders in input_ids for merge_multimodal_embeddings. So I think it is natural to allow for the disabling of chunking within multimodal items.