Support embedding models in V1 with a dedicated model_runner #18015

maxdebayser · 2025-05-12T19:02:43Z

This is an alternative to #16188 . In that other PR, I implemented embedding model support on the same model runner as the decoder models. This had the advantage that the code changes were fairly minimal. The other advantage in my opinion is that a single model runner implementation is less likely to become stale as new features and bug fixes only need to be applied to one code base. However, there were concerns about the performance implications and code complexity of a single implementation that tries to handle all cases.

In this PR I started by reverting all changes to the GPUModelRunner and created a GPUPoolingModelRunner basically by deleting everything that was related to sampling. In this state it was already passing the embedding model unit tests but there was still a lot of duplicated or unnecessary code.

Now I'm finished with the refactoring. Basically there is now a GPUBaseModelRunner that contains the common code and the GPUModelRunner and the GPUPoolingModelRunner implement the missing pieces.

There were a few issues where @22quinn spent some time thinking about:

kv-cache management: For encode models this is unnecessary because the attention mask is not causal and therefore optimizations such as chunked prefill and prefix caching don't apply. However, there are encoder models with pooling based in the last hidden state where these optimizations are applicable. One example is the intfloat/e5-mistral-7b-instruct that we use in the unit tests.
handling of m-rope, sliding window, multi-modal...: these thing are mostly orthogonal to pooling or sampling and went into the abstract base class
input batch management: when chunked prefill is disabled, in pooling models each request only stays in the batch for one execute_model call. However, with chunked prefill execute_model is called several times and the same logic that is used in the sampling models applies.
cascade attention: this only applies to decoding.

cc: @mgoin, @WoosukKwon , @DarkLight1337

Signed-off-by: Max de Bayser <[email protected]>

Encoder-only models can also benefit from the prefix caching that is enabled by the kv cache Signed-off-by: Max de Bayser <[email protected]>

This is only passing mypy, it hasn't been tested yet Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

... and disable cuda graphs for these models. Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

Refactor GPU model runner into a base model runner and a model runner for sampling and another for pooling. Signed-off-by: Max de Bayser <[email protected]>

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2025-05-22T20:39:18Z

There are some merge conflicts because of the KV cache group PR that was reverted, but it seems that it will be added again once the maintainer fixes the bugs that have been found after it was merged in main. So I'm going to wait a little bit before trying to solve the conflicts.

Signed-off-by: Max de Bayser <[email protected]>

maxdebayser · 2025-05-27T22:52:44Z

The kv cache group PR was redone so I've fixed the merge conflicts.

vllm/v1/worker/gpu_base_model_runner.py

22quinn · 2025-05-27T22:28:34Z

vllm/v1/pool/metadata.py

+    """Tensors for pooling."""
+
+    prompt_lens: torch.Tensor
+    prompt_token_ids: Optional[torch.Tensor]


Any benefit using torch.Tensor instead of list[int]? Same for prompt_lens

For prompt_lens it helps that the array is already in tensor format to do things like torch.cumsum(prompt_lens, dim=0). For prompt_token_ids there is no strong reason, but it allows us to reuse the same _make_prompt_token_ids_tensor() function as in the non-pooling case.

22quinn · 2025-05-27T22:29:01Z

vllm/v1/worker/gpu_pooling_input_batch.py

+
+        return PoolingMetadata(
+            prompt_lens=torch.from_numpy(
+                self.num_prompt_tokens[:self.num_reqs]).to(self.device),


Same q here - why convert to tensor?

See answer above.

vllm/model_executor/layers/pooler.py

vllm/model_executor/models/bert.py

22quinn · 2025-05-27T23:01:39Z

vllm/v1/core/sched/utils.py

-    last_token_id = request.output_token_ids[-1]
-    if (not sampling_params.ignore_eos
-            and last_token_id == request.eos_token_id):
+    if request.pooling_params and pooler_output is not None:


is it possible for pooler_output to be None here?

Yes, during chunked prefill.

vllm/v1/engine/async_llm.py

vllm/v1/sample/metadata.py

vllm/v1/structured_output/__init__.py

vllm/v1/worker/gpu_pooling_input_batch.py

facebook-github-bot · 2025-05-27T23:42:43Z

@22quinn has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Signed-off-by: Max de Bayser <[email protected]>

mergify · 2025-05-28T18:58:13Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Max de Bayser <[email protected]>

mergify · 2025-05-30T02:17:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maxdebayser.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

maxdebayser and others added 30 commits March 24, 2025 15:59

Remove guardrails that prevent V1 from trying to run embedding models

f36c4f9

Signed-off-by: Max de Bayser <[email protected]>

hack v1 flash_attn to support encoder_only

acf4638

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

b13bbc0

Signed-off-by: Max de Bayser <[email protected]>

Revert changes to disable kv caching for encoder-only models

8debea0

Encoder-only models can also benefit from the prefix caching that is enabled by the kv cache Signed-off-by: Max de Bayser <[email protected]>

Add pooling support in v1

8d97b9c

This is only passing mypy, it hasn't been tested yet Signed-off-by: Max de Bayser <[email protected]>

First end-to-end working version of Bert embeddings in V1

d60b22b

Signed-off-by: Max de Bayser <[email protected]>

Support warmup for pooling models in V1

6bebbb8

... and disable cuda graphs for these models. Signed-off-by: Max de Bayser <[email protected]>

address review comments

6dafd71

Signed-off-by: Max de Bayser <[email protected]>

address review comments

e2724a2

Signed-off-by: Max de Bayser <[email protected]>

remove debug prints

56ff6cd

Signed-off-by: Max de Bayser <[email protected]>

address review comments

fc57edd

Signed-off-by: Max de Bayser <[email protected]>

Fix cross encoder models in V1 and enable tests for pooling models

64a0e62

Signed-off-by: Max de Bayser <[email protected]>

address review comments

4014d41

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'main' into v1_embeddings

87a95a8

Signed-off-by: Max de Bayser <[email protected]>

address review comments

902c129

Signed-off-by: Max de Bayser <[email protected]>

re-enable large embedding models

2c68855

Signed-off-by: Max de Bayser <[email protected]>

address review comments

8afd8f5

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'main' into v1_embeddings

7762976

Merge branch 'upstream_main' into v1_embeddings

d7537ae

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

a9e7747

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

17520bd

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

90c611a

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

dec2441

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

a5e83f4

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

187f69b

Merge branch 'upstream_main' into v1_embeddings

69a0332

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

a9f1721

Signed-off-by: Max de Bayser <[email protected]>

fix merge problems

4b066a3

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

43a26dc

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings

ca34513

maxdebayser added 3 commits May 21, 2025 15:04

fix small errors

a96115e

Signed-off-by: Max de Bayser <[email protected]>

fix silly bug

5c050bb

Signed-off-by: Max de Bayser <[email protected]>

Refactor gpu model runner

b7cd175

Refactor GPU model runner into a base model runner and a model runner for sampling and another for pooling. Signed-off-by: Max de Bayser <[email protected]>

maxdebayser changed the title ~~[PoC] Support embedding models in V1 with a dedicated model_runner~~ Support embedding models in V1 with a dedicated model_runner May 22, 2025

maxdebayser mentioned this pull request May 22, 2025

[RFC]: Hidden states processor #12249

Open

1 task

fix small mistake

842d8fd

Signed-off-by: Max de Bayser <[email protected]>

DarkLight1337 mentioned this pull request May 23, 2025

[Feature]: Access and combine raw logits at inference time #18289

Open

1 task

Merge branch 'upstream_main' into v1_embeddings_runner

2954b22

Signed-off-by: Max de Bayser <[email protected]>

mergify bot removed the needs-rebase label May 27, 2025

disable cuda graphs for pooling

ed05f96

Signed-off-by: Max de Bayser <[email protected]>

22quinn reviewed May 27, 2025

View reviewed changes

maxdebayser added 6 commits May 28, 2025 10:02

Merge branch 'upstream_main' into v1_embeddings_runner

c42ec28

revert debugging change

403a143

Signed-off-by: Max de Bayser <[email protected]>

First pass on review comments

f07ff33

Signed-off-by: Max de Bayser <[email protected]>

rename pooling input batch

b2ba922

Signed-off-by: Max de Bayser <[email protected]>

remove duplicated class

f0a180f

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings_runner

f001ed9

mergify bot added the needs-rebase label May 28, 2025

maxdebayser added 4 commits May 28, 2025 18:38

Merge branch 'upstream_main' into v1_embeddings_runner

df87da3

Signed-off-by: Max de Bayser <[email protected]>

fix encoding test and activate v0 and v1

ba72032

Signed-off-by: Max de Bayser <[email protected]>

fix ordering of operations

eb73c02

Signed-off-by: Max de Bayser <[email protected]>

Merge branch 'upstream_main' into v1_embeddings_runner

d917aaf

Signed-off-by: Max de Bayser <[email protected]>

mergify bot removed the needs-rebase label May 29, 2025

trigger ci

d1c740d

Signed-off-by: Max de Bayser <[email protected]>

mergify bot added the needs-rebase label May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support embedding models in V1 with a dedicated model_runner #18015

Support embedding models in V1 with a dedicated model_runner #18015

Uh oh!

maxdebayser commented May 12, 2025 •

edited by github-actions bot

Loading

Uh oh!

maxdebayser commented May 22, 2025

Uh oh!

maxdebayser commented May 27, 2025

Uh oh!

Uh oh!

22quinn May 27, 2025

Uh oh!

maxdebayser May 28, 2025

Uh oh!

22quinn May 27, 2025

Uh oh!

maxdebayser May 28, 2025

Uh oh!

Uh oh!

Uh oh!

22quinn May 27, 2025

Uh oh!

maxdebayser May 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

mergify bot commented May 28, 2025

Uh oh!

mergify bot commented May 30, 2025

Uh oh!

Uh oh!

Uh oh!

Support embedding models in V1 with a dedicated model_runner #18015

Are you sure you want to change the base?

Support embedding models in V1 with a dedicated model_runner #18015

Uh oh!

Conversation

maxdebayser commented May 12, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxdebayser commented May 22, 2025

Uh oh!

maxdebayser commented May 27, 2025

Uh oh!

Uh oh!

22quinn May 27, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser May 28, 2025

Choose a reason for hiding this comment

Uh oh!

22quinn May 27, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

22quinn May 27, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

mergify bot commented May 28, 2025

Uh oh!

mergify bot commented May 30, 2025

Uh oh!

Uh oh!

maxdebayser commented May 12, 2025 •

edited by github-actions bot

Loading