[V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill #15837

mgoin · 2025-03-31T22:17:41Z

Introduces a disable_chunked_mm_input argument to SchedulerConfig that can prevent partial scheduling of tokens from a multimodal input item, used in V1. If the scheduled range would only cover part of the mm input, roll back to only schedule the tokens before the mm item.

This ensures that if a request has a mixed prompt (like text tokens TTTT followed by image tokens IIIIIIIIII) where only some image tokens can be scheduled (like TTTTIIIII, leaving IIIII for the next step), it will be scheduled as TTTT in one step and IIIIIIIIII in the next.

EDIT added context:
This is needed because the _gather_encoder_outputs function poses a problem in the TPU model runner when chunking through a multimodal item:

vllm/vllm/v1/worker/tpu_model_runner.py

Lines 530 to 537 in 46c759c

    
           end_idx = min( 
        
               num_computed_tokens - start_pos + num_scheduled_tokens, 
        
               num_encoder_tokens) 
        
           assert start_idx < end_idx 
        
           assert req_id in self.encoder_cache 
        
           assert i in self.encoder_cache[req_id] 
        
           encoder_output = self.encoder_cache[req_id][i] 
        
           encoder_outputs.append(encoder_output[start_idx:end_idx])

The last line encoder_output[start_idx:end_idx] will slice an on-device tensor with varying shape, triggering recompilation. Padding here is non-obvious because image features have to be aligned with image placeholders in input_ids for merge_multimodal_embeddings. So I think it is natural to allow for the disabling of chunking within multimodal items.

Signed-off-by: mgoin <[email protected]>

github-actions · 2025-03-31T22:17:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

WoosukKwon · 2025-04-01T06:19:44Z

What is this change for?

mgoin · 2025-04-01T09:54:40Z

@WoosukKwon This is needed because the _gather_encoder_outputs function poses a problem in the TPU model runner when chunking through a multimodal item:

vllm/vllm/v1/worker/tpu_model_runner.py

Lines 530 to 537 in 46c759c

    
           end_idx = min( 
        
               num_computed_tokens - start_pos + num_scheduled_tokens, 
        
               num_encoder_tokens) 
        
           assert start_idx < end_idx 
        
           assert req_id in self.encoder_cache 
        
           assert i in self.encoder_cache[req_id] 
        
           encoder_output = self.encoder_cache[req_id][i] 
        
           encoder_outputs.append(encoder_output[start_idx:end_idx])

The last line encoder_output[start_idx:end_idx] will slice an on-device tensor with varying shape, triggering recompilation. Padding here is non-obvious because image features have to be aligned with image placeholders in input_ids for merge_multimodal_embeddings. So I think it is natural to allow for the disabling of chunking within multimodal items.

mergify · 2025-04-01T16:10:12Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

WoosukKwon · 2025-04-02T20:09:28Z

@mgoin Thanks for the explanation.
I feel like the idea itself makes sense, but I might have to think more whether there's any edge case.
BTW, I think we shouldn't add an env variable for this kind of case. vLLM should do this automatically, or if we still want to provide an option, we should provide an engine arg (like disable_custom_all_reduce) rather than an env variable.

Signed-off-by: mgoin <[email protected]>

mgoin · 2025-04-03T14:48:04Z

I think we need to add a check that max_num_batched_tokens is large enough to fit the largest single multimodal item, but this should be ready for consideration otherwise cc @ywang96 @DarkLight1337

DarkLight1337 · 2025-04-03T15:07:51Z

So even if #15712 is merged, it will trigger recompilation on TPU? Can we exclude _gather_encoder_outputs from the graph?

mgoin · 2025-04-07T14:46:55Z

@DarkLight1337 _gather_encoder_outputs will create its own graph separate from the model forward pass. Anything that deals with tensors on device will end up creating an XLA graph, but we gain a lot by separating tricky operations that often create recompilation into smaller graphs.

cc @NickLucche is this good with you?

NickLucche

This appears to be working really well to address the TPU issue we have. Great job @mgoin !

DarkLight1337

@DarkLight1337 _gather_encoder_outputs will create its own graph separate from the model forward pass. Anything that deals with tensors on device will end up creating an XLA graph, but we gain a lot by separating tricky operations that often create recompilation into smaller graphs.

Thanks for the explanation!

ywang96 · 2025-04-08T06:32:00Z

Sorry I'm late to this PR, but what happens if the embedding of a multimodal data item is bigger than the max_num_batched_tokens?

NickLucche · 2025-04-08T07:43:03Z

From the test it appears it's not scheduling, so we need to add that check today.

mgoin · 2025-04-08T11:36:14Z

I meant to add that before landing, sorry I didn't realize that auto merge was on. Will push this up today

DarkLight1337 · 2025-04-08T11:39:41Z

My bad, I thought the PR was already ready

…refill (vllm-project#15837) Signed-off-by: mgoin <[email protected]>

Add flag to disable partial mm input chunked prefill

85d500d

Signed-off-by: mgoin <[email protected]>

mergify bot added the v1 label Mar 31, 2025

mergify bot added the needs-rebase label Apr 1, 2025

mgoin added 2 commits April 3, 2025 14:11

Merge branch 'main' into disable-mm-input-chunking

481b812

Signed-off-by: mgoin <[email protected]>

Update to use arg and add test for scheduler

1e5d1cd

Signed-off-by: mgoin <[email protected]>

mergify bot removed the needs-rebase label Apr 3, 2025

mgoin changed the title ~~Add flag to disable partial mm input chunked prefill~~ Add disable_chunked_mm_input arg to disable partial mm input prefill Apr 3, 2025

mgoin changed the title ~~Add disable_chunked_mm_input arg to disable partial mm input prefill~~ [V1] Add disable_chunked_mm_input arg to disable partial mm input prefill Apr 3, 2025

mgoin marked this pull request as ready for review April 3, 2025 14:46

mgoin requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners April 3, 2025 14:46

NickLucche approved these changes Apr 7, 2025

View reviewed changes

DarkLight1337 approved these changes Apr 8, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) April 8, 2025 01:59

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 8, 2025

Merge branch 'main' into disable-mm-input-chunking

28e4e42

vllm-bot merged commit 8e5314a into vllm-project:main Apr 8, 2025
42 of 44 checks passed

nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025

[V1] Add disable_chunked_mm_input arg to disable partial mm input p…

5ba2271

…refill (vllm-project#15837) Signed-off-by: mgoin <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill #15837

[V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill #15837

mgoin commented Mar 31, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 31, 2025

WoosukKwon commented Apr 1, 2025

mgoin commented Apr 1, 2025

mergify bot commented Apr 1, 2025

WoosukKwon commented Apr 2, 2025

mgoin commented Apr 3, 2025

DarkLight1337 commented Apr 3, 2025

mgoin commented Apr 7, 2025

NickLucche left a comment

DarkLight1337 left a comment

ywang96 commented Apr 8, 2025

NickLucche commented Apr 8, 2025

mgoin commented Apr 8, 2025

DarkLight1337 commented Apr 8, 2025

	end_idx = min(
	num_computed_tokens - start_pos + num_scheduled_tokens,
	num_encoder_tokens)
	assert start_idx < end_idx
	assert req_id in self.encoder_cache
	assert i in self.encoder_cache[req_id]
	encoder_output = self.encoder_cache[req_id][i]
	encoder_outputs.append(encoder_output[start_idx:end_idx])

[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill #15837

[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill #15837

Conversation

mgoin commented Mar 31, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 31, 2025

WoosukKwon commented Apr 1, 2025

mgoin commented Apr 1, 2025

mergify bot commented Apr 1, 2025

WoosukKwon commented Apr 2, 2025

mgoin commented Apr 3, 2025

DarkLight1337 commented Apr 3, 2025

mgoin commented Apr 7, 2025

NickLucche left a comment

Choose a reason for hiding this comment

DarkLight1337 left a comment

Choose a reason for hiding this comment

ywang96 commented Apr 8, 2025

NickLucche commented Apr 8, 2025

mgoin commented Apr 8, 2025

DarkLight1337 commented Apr 8, 2025

[V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill #15837

[V1] Add `disable_chunked_mm_input` arg to disable partial mm input prefill #15837

mgoin commented Mar 31, 2025 •

edited by github-actions bot

Loading