Skip to content

[Bug]: Prefix caching doesn't work for LlavaOneVision #11371

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
sleepwalker2017 opened this issue Dec 20, 2024 · 13 comments
Closed
1 task done

[Bug]: Prefix caching doesn't work for LlavaOneVision #11371

sleepwalker2017 opened this issue Dec 20, 2024 · 13 comments
Labels
bug Something isn't working stale Over 90 days of inactivity

Comments

@sleepwalker2017
Copy link
Contributor

Your current environment

The generated dummy input is video, but the preprocessor tries to get image from the dict, and then it crashes.

After I walk around this, the code still fails to run.

It complains this:

  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/engine/core.py", line 264, in run_engine_core
    engine_core.run_busy_loop()
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/engine/core.py", line 302, in run_busy_loop
    outputs = self.step()
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/engine/core.py", line 125, in step
    output = self.model_executor.execute_model(scheduler_output)
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/executor/uniproc_executor.py", line 72, in execute_model
    output = self.worker.execute_model(scheduler_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/worker/gpu_worker.py", line 203, in execute_model                                                                                                              output = self.model_runner.execute_model(scheduler_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/worker/gpu_model_runner.py", line 472, in execute_model
    encoder_outputs = self._gather_encoder_outputs(scheduler_output)
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/worker/gpu_model_runner.py", line 456, in _gather_encoder_outputs
    assert req_id in self.encoder_cache

Model Input Dumps

model=/data/models/llava-onevision-qwen2-7b-ov-hf/
VLLM_USE_V1=1 VLLM_ENABLE_V1_MULTIPROCESSING=1 python3 mmmu_bench.py --model $model --num-prompts 500  --image-hit-rate 0.3

The mmmu_bench.py comes from here:
#11187

🐛 Describe the bug

  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/engine/core.py", line 264, in run_engine_core
    engine_core.run_busy_loop()
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/engine/core.py", line 302, in run_busy_loop
    outputs = self.step()
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/engine/core.py", line 125, in step
    output = self.model_executor.execute_model(scheduler_output)
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/executor/uniproc_executor.py", line 72, in execute_model
    output = self.worker.execute_model(scheduler_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/worker/gpu_worker.py", line 203, in execute_model                                                                                                              output = self.model_runner.execute_model(scheduler_output)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/worker/gpu_model_runner.py", line 472, in execute_model
    encoder_outputs = self._gather_encoder_outputs(scheduler_output)
  File "/data/test/prefix-caching-vlm/vllm/vllm/v1/worker/gpu_model_runner.py", line 456, in _gather_encoder_outputs
    assert req_id in self.encoder_cache

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@sleepwalker2017 sleepwalker2017 added the bug Something isn't working label Dec 20, 2024
@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 20, 2024

Most multi-modal models don't support V1 yet. You can check the Supported Models page for more details (there is a V1 column for multi-modal models).

@sleepwalker2017
Copy link
Contributor Author

Most multi-modal models don't support V1 yet. You can check the Supported Models page for more details (there is a V1 column for multi-modal models).

Thank you! Is there a link for that? I didn't find documents about the V1 Engine.

@DarkLight1337
Copy link
Member

Most multi-modal models don't support V1 yet. You can check the Supported Models page for more details (there is a V1 column for multi-modal models).

Thank you! Is there a link for that? I didn't find documents about the V1 Engine.

It is pinned in the list of issues .

@DarkLight1337
Copy link
Member

cc @ywang96 perhaps we should add a link to the V1 column header?

@sleepwalker2017
Copy link
Contributor Author

Most multi-modal models don't support V1 yet. You can check the Supported Models page for more details (there is a V1 column for multi-modal models).

Thank you! Is there a link for that? I didn't find documents about the V1 Engine.

It is pinned in the list of issues .

I see this one #8779, but I didn't find any examples about its usage.

It seems the V1 engine is not totally the same to use as the old one.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 23, 2024

Most multi-modal models don't support V1 yet. You can check the Supported Models page for more details (there is a V1 column for multi-modal models).

Thank you! Is there a link for that? I didn't find documents about the V1 Engine.

It is pinned in the list of issues .

I see this one #8779, but I didn't find any examples about its usage.

It seems the V1 engine is not totally the same to use as the old one.

It is still in development which is why we don't have user-facing docs about it yet. For now, you can enable it by setting environment variable VLLM_USE_V1=1.

@ywang96
Copy link
Member

ywang96 commented Dec 23, 2024

@sleepwalker2017 V1 is only available for experimental use and not all multimodal models have yet been supported on V1. You can check our latest documentation here https://docs.vllm.ai/en/latest/models/supported_models.html#id3 (the V1 column) to see which models are supported.

@sleepwalker2017
Copy link
Contributor Author

https://docs.vllm.ai/en/latest/models/supported_models.html#id3

Thank you for the clear explanation!

@sleepwalker2017
Copy link
Contributor Author

Most multi-modal models don't support V1 yet. You can check the Supported Models page for more details (there is a V1 column for multi-modal models).

Hi, I have another question, what is needed if I want to add support for a multi-modal model in V1 engine? On the condition that it's already supported by the old engine.

@ywang96
Copy link
Member

ywang96 commented Dec 26, 2024

Most multi-modal models don't support V1 yet. You can check the Supported Models page for more details (there is a V1 column for multi-modal models).

Hi, I have another question, what is needed if I want to add support for a multi-modal model in V1 engine? On the condition that it's already supported by the old engine.

@sleepwalker2017 There are a few key changes you'll need to do:

  1. the model's input processor needs to return the PlaceholderRange that tracks where exactly the placeholder tokens for each image starts and its length.
  2. The output of get_multimodal_embeddings needs to be
    • Either a tuple of flattened image embeddings (2D tensor of shape [feature_size, hidden_size]), with each corresponding to an image, or
    • A batched 3D tensor in case feature_size is constant across images.

Feel free to take a look at #10699 to see the changes needed. Also for now we only support image modality on V1.

@sleepwalker2017
Copy link
Contributor Author

changes

Thank you! I'll check that!

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Mar 27, 2025
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2025
@DarkLight1337 DarkLight1337 reopened this Apr 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Over 90 days of inactivity
Projects
None yet
Development

No branches or pull requests

3 participants