Skip to content

[V1] Scatter and gather placeholders in the model runner #15712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Apr 4, 2025

Conversation

DarkLight1337
Copy link
Member

@DarkLight1337 DarkLight1337 commented Mar 28, 2025

This PR is an attempt to move scatter_patch_features and gather_patch_feature into the model runner (outside of the model) to avoid interfering with TPU graph compilation.

Breaking change for model developers:

  • PromptUpdateDetails.features has been replaced with PromptUpdateDetails.is_embed. You can use the newly added factories PromptUpdateDetails.select_text and PromptUpdateDetails.select_token_id to generate is_embed based on the target text/token ID.
  • BaseProcessingInfo.get_num_image_tokens should now return the equivalent of PromptUpdateDetails.is_embed.sum() instead of the number of tokens in PromptUpdateDetails.features.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) v1 tpu Related to Google TPUs labels Mar 28, 2025
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks as always for the great work @DarkLight1337! I'm not an expert in most of what's changed here but did notice one thing.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and even removes a lot of complex code. It should fix the immediate issue we have with llava on TPU since we get to skip the logic now in the non-Pixtral case. Even if we are stuck with dynamic creation, we can isolate it in a smaller graph with this refactor.

Does this still have many issues to resolve or do you think it could be tractable this week?

Copy link

mergify bot commented Mar 31, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @DarkLight1337.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 31, 2025
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Mar 31, 2025

I am working on making sure all of our existing multi-modal models on V1 return a sequence of 2D embeddings. Turns out quite a few models don't follow this... Done in #15816

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@DarkLight1337
Copy link
Member Author

DarkLight1337 commented Mar 31, 2025

@mgoin if you have time, can you help check the following models locally? I have verified a few models on my end but there are still many to go:

tests/models/decoder_only/vision_language/test_models.py:

  • aya_vision
  • chameleon
  • fuyu
  • gemma3
  • h2ovl
  • idefics3
  • internvl
  • llava
  • minicpmo
  • minicpmv
  • molmo
  • nvlm_d
  • phi3v
  • qwen_vl
  • skywork_r1v
  • (Upcoming) qwen2_5_omni

Run the example scripts:

  • pixtral_hf
  • mistral3
  • pixtral (Mistral format)
  • qwen2_audio

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
@mgoin
Copy link
Member

mgoin commented Apr 1, 2025

  • Fuyu fails as it still has from .vision import scatter_patch_features, select_patch_features at the top of its model definition. Tested with pytest tests/models/decoder_only/vision_language/test_models.py -k "[fuyu-"

  • Llava works fine, passes pytest tests/models/decoder_only/vision_language/test_models.py -k "[llava-"

  • NVLM-D failed using the example script python examples/offline_inference/vision_language.py -m NVLM_D

ERROR 04-01 12:12:46 [core.py:377] Traceback (most recent call last):
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/executor/multiproc_executor.py", line 376, in worker_busy_loop
ERROR 04-01 12:12:46 [core.py:377]     output = func(*args, **kwargs)
ERROR 04-01 12:12:46 [core.py:377]              ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-01 12:12:46 [core.py:377]     return func(*args, **kwargs)
ERROR 04-01 12:12:46 [core.py:377]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
ERROR 04-01 12:12:46 [core.py:377]     self.model_runner.profile_run()
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1504, in profile_run
ERROR 04-01 12:12:46 [core.py:377]     dummy_mm_kwargs = self.mm_registry.get_decoder_dummy_data(
ERROR 04-01 12:12:46 [core.py:377]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/registry.py", line 470, in get_decoder_dummy_data
ERROR 04-01 12:12:46 [core.py:377]     dummy_data = profiler.get_decoder_dummy_data(seq_len, mm_counts)
ERROR 04-01 12:12:46 [core.py:377]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/profiling.py", line 224, in get_decoder_dummy_data
ERROR 04-01 12:12:46 [core.py:377]     ) = self.get_and_validate_mm_inputs(seq_len, mm_counts)
ERROR 04-01 12:12:46 [core.py:377]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/profiling.py", line 179, in get_and_validate_mm_inputs
ERROR 04-01 12:12:46 [core.py:377]     mm_inputs = self._get_dummy_mm_inputs(seq_len, mm_counts)
ERROR 04-01 12:12:46 [core.py:377]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/profiling.py", line 154, in _get_dummy_mm_inputs
ERROR 04-01 12:12:46 [core.py:377]     return self.processor.apply(
ERROR 04-01 12:12:46 [core.py:377]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/processing.py", line 1639, in apply
ERROR 04-01 12:12:46 [core.py:377]     self._validate_mm_placeholders(mm_placeholders, mm_item_counts)
ERROR 04-01 12:12:46 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/processing.py", line 1562, in _validate_mm_placeholders
ERROR 04-01 12:12:46 [core.py:377]     raise RuntimeError(
ERROR 04-01 12:12:46 [core.py:377] RuntimeError: Expected there to be 1 prompt updates corresponding to 1 image items, but instead found 0 prompt updates! Either the prompt text has missing/incorrect tokens for multi-modal inputs, or there is a problem with your implementation of merged multi-modal processor for this model (usually arising from an inconsistency between `_call_hf_processor` and `_get_prompt_updates`).
  • Pixtral/Mistral Small runs into an error with the mistral-small.py example script
python examples/offline_inference/mistral-small.py simple

INFO 04-01 11:29:43 [gpu_model_runner.py:1498] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 2 image items of the maximum feature size.
ERROR 04-01 11:29:43 [core.py:377] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 365, in run_engine_core
ERROR 04-01 11:29:43 [core.py:377]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-01 11:29:43 [core.py:377]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 306, in __init__
ERROR 04-01 11:29:43 [core.py:377]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 69, in __init__
ERROR 04-01 11:29:43 [core.py:377]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 04-01 11:29:43 [core.py:377]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 130, in _initialize_kv_caches
ERROR 04-01 11:29:43 [core.py:377]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-01 11:29:43 [core.py:377]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 04-01 11:29:43 [core.py:377]     output = self.collective_rpc("determine_available_memory")
ERROR 04-01 11:29:43 [core.py:377]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-01 11:29:43 [core.py:377]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-01 11:29:43 [core.py:377]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/utils.py", line 2329, in run_method
ERROR 04-01 11:29:43 [core.py:377]     return func(*args, **kwargs)
ERROR 04-01 11:29:43 [core.py:377]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-01 11:29:43 [core.py:377]     return func(*args, **kwargs)
ERROR 04-01 11:29:43 [core.py:377]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
ERROR 04-01 11:29:43 [core.py:377]     self.model_runner.profile_run()
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1504, in profile_run
ERROR 04-01 11:29:43 [core.py:377]     dummy_mm_kwargs = self.mm_registry.get_decoder_dummy_data(
ERROR 04-01 11:29:43 [core.py:377]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/registry.py", line 470, in get_decoder_dummy_data
ERROR 04-01 11:29:43 [core.py:377]     dummy_data = profiler.get_decoder_dummy_data(seq_len, mm_counts)
ERROR 04-01 11:29:43 [core.py:377]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/profiling.py", line 224, in get_decoder_dummy_data
ERROR 04-01 11:29:43 [core.py:377]     ) = self.get_and_validate_mm_inputs(seq_len, mm_counts)
ERROR 04-01 11:29:43 [core.py:377]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:29:43 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/profiling.py", line 191, in get_and_validate_mm_inputs
ERROR 04-01 11:29:43 [core.py:377]     raise AssertionError(
ERROR 04-01 11:29:43 [core.py:377] AssertionError: The processed dummy data has a total of {'image': 3025} placeholder tokens, which is not the expected {'image': 3080} tokens.
ERROR 04-01 11:29:43 [core.py:377] 
CRITICAL 04-01 11:29:43 [core_client.py:343] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
  • Pixtral HF runs into a similar error:
python examples/offline_inference/vision_language.py -m pixtral_hf   

INFO 04-01 11:50:46 [gpu_model_runner.py:1498] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 2 image items of the maximum feature size.
ERROR 04-01 11:50:48 [core.py:377] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 365, in run_engine_core
ERROR 04-01 11:50:48 [core.py:377]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-01 11:50:48 [core.py:377]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 306, in __init__
ERROR 04-01 11:50:48 [core.py:377]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 69, in __init__
ERROR 04-01 11:50:48 [core.py:377]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 04-01 11:50:48 [core.py:377]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/engine/core.py", line 130, in _initialize_kv_caches
ERROR 04-01 11:50:48 [core.py:377]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-01 11:50:48 [core.py:377]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/executor/abstract.py", line 66, in determine_available_memory
ERROR 04-01 11:50:48 [core.py:377]     output = self.collective_rpc("determine_available_memory")
ERROR 04-01 11:50:48 [core.py:377]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-01 11:50:48 [core.py:377]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-01 11:50:48 [core.py:377]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/utils.py", line 2329, in run_method
ERROR 04-01 11:50:48 [core.py:377]     return func(*args, **kwargs)
ERROR 04-01 11:50:48 [core.py:377]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/venvs/vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-01 11:50:48 [core.py:377]     return func(*args, **kwargs)
ERROR 04-01 11:50:48 [core.py:377]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_worker.py", line 157, in determine_available_memory
ERROR 04-01 11:50:48 [core.py:377]     self.model_runner.profile_run()
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/v1/worker/gpu_model_runner.py", line 1504, in profile_run
ERROR 04-01 11:50:48 [core.py:377]     dummy_mm_kwargs = self.mm_registry.get_decoder_dummy_data(
ERROR 04-01 11:50:48 [core.py:377]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/registry.py", line 470, in get_decoder_dummy_data
ERROR 04-01 11:50:48 [core.py:377]     dummy_data = profiler.get_decoder_dummy_data(seq_len, mm_counts)
ERROR 04-01 11:50:48 [core.py:377]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/profiling.py", line 224, in get_decoder_dummy_data
ERROR 04-01 11:50:48 [core.py:377]     ) = self.get_and_validate_mm_inputs(seq_len, mm_counts)
ERROR 04-01 11:50:48 [core.py:377]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 11:50:48 [core.py:377]   File "/home/mgoin/code/vllm/vllm/multimodal/profiling.py", line 191, in get_and_validate_mm_inputs
ERROR 04-01 11:50:48 [core.py:377]     raise AssertionError(
ERROR 04-01 11:50:48 [core.py:377] AssertionError: The processed dummy data has a total of {'image': 4096} placeholder tokens, which is not the expected {'image': 4160} tokens.
ERROR 04-01 11:50:48 [core.py:377] 
CRITICAL 04-01 11:50:48 [core_client.py:343] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

Signed-off-by: DarkLight1337 <[email protected]>
Roger Wang added 2 commits April 3, 2025 21:11
Signed-off-by: Roger Wang <[email protected]>
Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like increasing the audio fetch timeout indeed fixes the test, so I assume it's probably just a cold start issue?

Anyways LGTM :shipit:

@ywang96 ywang96 enabled auto-merge (squash) April 4, 2025 06:21
Signed-off-by: Roger Wang <[email protected]>
@ywang96
Copy link
Member

ywang96 commented Apr 4, 2025

pytest -v -s -x models/decoder_only/vision_language/test_pixtral.py::test_chat is failing on CI in the extended multimodal test so I'm currently looking into fixing it.

Signed-off-by: Roger Wang <[email protected]>
@ywang96 ywang96 removed the ready ONLY add when PR is ready to merge/full CI is needed label Apr 4, 2025
@ywang96 ywang96 merged commit f5722a5 into vllm-project:main Apr 4, 2025
47 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in Multi-modality Core Apr 4, 2025
ywang96 added a commit that referenced this pull request Apr 4, 2025
Copy link

mergify bot commented Apr 4, 2025

⚠️ The sha of the head commit of this PR conflicts with #16076. Mergify cannot evaluate rules on this PR. ⚠️

@DarkLight1337 DarkLight1337 deleted the v1-is-embed branch April 5, 2025 10:37
Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
…t#15712)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: xinyuxiao <[email protected]>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
…t#15712)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Signed-off-by: Louis Ulmer <[email protected]>
nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
…t#15712)

Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: mgoin <[email protected]>
Signed-off-by: Roger Wang <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation multi-modality Related to multi-modality (#4194) tpu Related to Google TPUs v1
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants