Added the option of returning hidden states #15434

Settheworldonfireiii · 2025-03-25T03:34:02Z

Added option of returning hidden states in response to a flag in SamplingParams

files that were modified: engine/llm_engine.py, outputs.py, v1/outputs.py, v1/worker/gpu_model_runner.py, worker/model_runner.py, sampling_params.py, v1/engine/output_processor.py, v1/core/sched/scheduler.py, sequence.py , v1/serial_utils.py, v1/engine/init.py

Will be grateful if someone runs entrypoint tests

github-actions · 2025-03-25T03:34:11Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-03-25T04:40:12Z

cc @maxdebayser how does this fit with your implementation of #12249 ?

Settheworldonfireiii · 2025-03-26T01:26:06Z

cc @maxdebayser how does this fit with your implementation of #12249 ?

I think that my PR largely duplicates #12249 , but it is much more narrow in scope, outputs only last step's last attention block's hidden states which are already extracted inside ModelRunner/GPUModelRunner and thus avoids the slowdown concerns voiced by some contributors in the discussion you mentioned. My intention was to build it in a minimally invasive way.

In perspective, I was thinking of a larger update: since the rise of mechinterp, some users might want to get custom layers' hidden states, say, "I want 5th attention block's hidden states", or "I want 4-9 attention blocks' hidden states for steps 3-9" (passed as some sort of a dictionary)

Now, whether to pass it as an argument into LLM(), or as a SamplingParam into generate(), is another question. Both seem fine to me, and I would like to hear your opinion.

The argument for passing it into generate() is that some users may want to get custom hidden states in one call to generate() and not get it in another call to generate() of the same instance.

However, implementing these custom hidden_states might be more complicated, and likely would involve either adding/modifying existing outputs processor classes or, even more, modifying the model implementation itself, and thus as @youkaichao noted, it might slow down inference.

So, the alternative approach is to pass some flag/outputs processor on the instance level, and then in each inference request specify which hidden states the user wants. It will be a separate instance, very useful for mechinterp people, potentially slower, but as an optional feature, other users who don't want any hidden states will not experience any slowdown.

I can implement it both ways. Can give more exact dates of when I can do it once we clarify if this line of work is needed and in which of the ways mentioned above should I implement it.

The feature implemented in this PR is just a small enhancement that does not affect speed and does not offer much, but I can build more.

Also, I would like to mention that despite being annotated as v1, it implements the output hidden states feature for both v0.x and v1.

DarkLight1337 · 2025-03-26T04:40:56Z

@robertgshaw2-redhat is V1 runner stable enough to add new features like this to it? Otherwise I'm thinking of limiting this PR to V0 as a POC before extending it to V1.

DarkLight1337 · 2025-03-26T04:41:48Z

Also @maxdebayser any updates on your progress? Let's coordinate this together.

Settheworldonfireiii · 2025-03-26T05:49:09Z

@DarkLight1337 also, for v0, I keep running into an error in my entrypoint tests which I think is not related to my PR but rather to these lines:

seq_group = engine._add_processed_request( request_id_i, params=params, **kwargs, ) # type: ignore

in https://github.com/vllm-project/vllm/blob/main/vllm/sequence.py sequence.py file, lines 1428 - 1439.

When I change request_id_i to request_id , locally the test that fails in the buildkite runs well, but I want to double-check since it was not me who introduced this request_id_i and I do not fully understand what is the reason behind it

DarkLight1337 · 2025-03-26T08:40:47Z

Can you merge in the latest changes on main? Perhaps it got fixed recently

DarkLight1337 · 2025-03-26T08:41:22Z

@!

@@ -0,0 +1,2139 @@
+# SPDX-License-Identifier: Apache-2.0


I think you added this file by mistake

thank you!

I am the first time contributor, so I wonder what are my next actions?

It passed the buildkite tests and DCO, and now I also fixed the potential reason for mypy failure (even though it seemed strange that it failed mypy with the following:

Error: vllm/engine/llm_engine.py:1119: error: "list[Any]" has no attribute "hidden_states" [attr-defined]

and my line 1119 in this file is:

output = [outputs_by_sequence_group[0][i]]
)

For pre-commit tests, there are also some errors
errors:

Error: vllm/worker/model_runner.py:1719:81: E501 Line too long (85 > 80) Error: vllm/worker/model_runner.py:1720:81: E501 Line too long (82 > 80) Error: vllm/worker/model_runner.py:1723:81: E501 Line too long (81 > 80)

were introduced by other contributors.

Please install pre-commit according to here: https://docs.vllm.ai/en/latest/contributing/overview.html

Then you can run pre-commit run --all-files to check and fix the errors

You can ignore the errors that are not from your PR

apparently, some of them were accidentally introduced by me..but now it is fixed and I expect it will passed

You can ignore the errors that are not from your PR

but now I guess everything is fixed, it passed all the tests.

maxdebayser · 2025-03-26T18:18:23Z

@DarkLight1337 , it seems to me that returning the raw hidden states could be a special case of the hidden states processor. Currently I'm still tinkering with V1, because I don't know how long V0 will be supported, but if you prefer we could restrict the scope to V0 first to make this feature available sooner.

Settheworldonfireiii · 2025-03-26T20:38:07Z

@DarkLight1337 , it seems to me that returning the raw hidden states could be a special case of the hidden states processor. Currently I'm still tinkering with V1, because I don't know how long V0 will be supported, but if you prefer we could restrict the scope to V0 first to make this feature available sooner.

I think in the #12249 the main dilemma was whether to create it per instance or per task.

another question is whether returning custom hidden states, e.g. {steps: kv-cache filling + generation of 3 first characters, layers: "embedding and 3 first attention layers} can be efficiently implemented without altering the model implementation itself, because by looking at vllm/model_executor/models/ , most of actual model implementations do not seem to accumulate /save hidden states of layers, and the final output of the model execution outputs.

maxdebayser · 2025-03-26T22:09:21Z

I think in the #12249 the main dilemma was whether to create it per instance or per task.

Returning all the hidden layers should be something that is enabled per instance in my opinion. One of the use cases of vLLM is to provide generic inference services in multi-tenant cloud environments. I think product managers would be reluctant do enable features by default whereby a few users could degrade the performance for many other requests. Since this is for a very specific use case, it should be an opt-in feature.

another question is whether returning custom hidden states, e.g. {steps: kv-cache filling + generation of 3 first characters, layers: "embedding and 3 first attention layers} can be efficiently implemented without altering the model implementation itself, because by looking at vllm/model_executor/models/ , most of actual model implementations do not seem to accumulate /save hidden states of layers, and the final output of the model execution outputs.

I think in pytorch there are hooks to inspect and trace the execution of models, so you could get the hidden states without changing the model code as long as you can identify the correct layers.

Settheworldonfireiii · 2025-03-26T23:19:54Z

Returning all the hidden layers should be something that is enabled per instance in my opinion. One of the use cases of vLLM is to provide generic inference services in multi-tenant cloud environments. I think product managers would be reluctant do enable features by default whereby a few users could degrade the performance for many other requests. Since this is for a very specific use case, it should be an opt-in feature.

Agree.

I think in pytorch there are hooks to inspect and trace the execution of models, so you could get the hidden states without changing the model code as long as you can identify the correct layers.

That is one way of implementing it. Hooks can be significantly slowing down, but if there is no other way probably will go with it. I need to investigate/think of it more.

WoosukKwon

Thanks for the PR. I think this PR needs a wider discussion. I'd like to hold it off.

Settheworldonfireiii · 2025-04-02T00:47:13Z

PTAL

@DarkLight1337
I am working on it. As an alternative, maybe we merge this branch?

0b50102

it passed all the tests

NishanthVAnand · 2025-04-02T00:49:11Z

Thanks for the PR. I think this PR needs a wider discussion. I'd like to hold it off.

@WoosukKwon could you please share more details on what you mean by a wider discussion? Like many mechanistic interpretability researchers, I was looking forward to this feature for some time now.

Settheworldonfireiii · 2025-04-02T00:51:01Z

Thanks for the PR. I think this PR needs a wider discussion. I'd like to hold it off.

@WoosukKwon which changes would you like to see?

I could probably change it significantly starting from next week

Signed-off-by: Settheworldonfireiii <[email protected]>

Settheworldonfireiii · 2025-04-02T01:52:10Z

When this commit is reviewed and merged, what does the below HF transformers code translate to in order to get the embeddings? And, how much speed up can we expect from vLLMs?

local_dir = "/network/weights/llama.var/llama_3.1/Meta-Llama-3.1-8B-Instruct/"
llm_pretrained = AutoModelForCausalLM.from_pretrained(local_dir, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(local_dir, torch_dtype=torch.float16)

batch_tokens = tokenizer(text_obs, return_tensors="pt", add_special_tokens=False, padding=True, truncation=True).to(llm_pretrained.device)
with torch.no_grad():
    hidden_states = llm_pretrained(**batch_tokens, output_hidden_states=True)["hidden_states"][layer]

@NishanthVAnand this particular merge will only allow to output the attention block's hidden states, because of the potential slow down concerns;

@maxdebayser and I were separately working on a more flexible and more informative possibilities of outputting custom hidden states of any layer, any particular step, but I have other projects and also need to finish this branch, so it will be a while..

Signed-off-by: Settheworldonfireiii <[email protected]>

Settheworldonfireiii · 2025-04-02T05:55:53Z

@DarkLight1337 it passed all tests now.

so I think we have 2 branches ready to merge for v0 — this and that one , unless @WoosukKwon will specify which changes he would like to see, so that I can implement it before merge.

Signed-off-by: Settheworldonfireiii <[email protected]>

Settheworldonfireiii · 2025-04-02T20:18:45Z

@NishanthVAnand as for the code you provided, with the current functionality I implemented, it is going to be something like

  model = LLM(
        model_name, 
        #additional arguments like tensor_parallel_size etc)
  sampling_params = SamplingParams(
            # some other arguments, like max_tokens, min_tokens, stop_token_ids, etc
            return_hidden_states = True
        )
  o = model.generate(
            prompt,
            sampling_params=sampling_params,
        )
  #to print out hidden states
  print(o[0].hidden_states)

in version v1, supposedly it is gonna be like

  model = LLM(
        model_name, 
        #additional arguments like tensor_parallel_size etc)
  sampling_params = SamplingParams(
            # some other arguments, like max_tokens, min_tokens, stop_token_ids, etc
            hidden_states_to_return = {"layers" : {0:3}, "sublayers" : {"attention", "fc1"}, steps : { "prefill", "first", "14 "}
          )
  # meaning you will return only the output of the first 3 attention block's MHA  and the output of the first
  # fully connected layer within these attention blocks, and only the kv cache prefill and first 14 generation
  # steps 
  o = model.generate(
            prompt,
            sampling_params=sampling_params,
        )
  #to print out hidden states
  print(o[0].hidden_states)

DarkLight1337 · 2025-04-03T09:11:38Z

This looks good to me now - @WoosukKwon could you explain the concerns you have regarding this PR?

TachyonGun · 2025-04-04T04:57:22Z

This functionality would be very useful for folks trying to do mechinterp research. Using vLLM to retrieve hidden states would be very helpful to collect datasets of model activations for training sparse autoencoders, or other kind of analyses.

Settheworldonfireiii · 2025-04-08T20:17:11Z

@DarkLight1337 @WoosukKwon any updates/recommendations w r t changes ?

sastpg · 2025-04-09T03:18:29Z

I find this feature is useful to me!

Does the following HF transformers code generation_output.hidden_states produce the same output as o[0].hidden_states?

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoConfig,
)
config = AutoConfig.from_pretrained(MODEL_REPO + args.model_name, **config_kwargs)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_REPO + args.model_name,
    config=config,
    torch_dtype=torch.float32,
    device_map='auto',
    trust_remote_code=True
)
generation_output = self.model.generate(
    input_ids=input_ids.to(device),
    generation_config=self.generation_config,
    max_new_tokens=self.max_output_token,
    output_attentions=True,
    output_hidden_states=True,
    output_scores=True,
    do_sample=False,
)

print(generation_output.hidden_states)     # output_len x layer_num x sampling_num x beam_search x hidden_dim

Thank you very much! I'm looking forward to this new feature.

youkaichao · 2025-04-18T12:56:11Z

vllm/worker/model_runner.py

+        # overrides self.return_hidden_states that was
+        # assigned during initialization
+        # the rationale is giving users the option
+        # to receive hidden states or not
+        # from the same model w/o re-init it
+        if (model_input.sampling_metadata is not None
+                and hasattr(model_input.sampling_metadata, 'seq_groups')
+                and model_input.sampling_metadata.seq_groups is not None):
+            self.return_hidden_states = (
+                model_input.sampling_metadata.seq_groups[0].sampling_params.
+                return_hidden_states)
+


this will not work. you are putting return_hidden_states as a request-level parameter, which means you can get a mixture of a batch, some of them requires token output, while some of them requires hidden state output. This cannot be handled well.

I don't think we will accept this feature. If you really want to use it, you can just change the code in vllm/worker/model_runner.py to write the tensor output to a file, and you just read the file directly.

A similar ask is to get attention masks from vllm, which we will not accept, either.

We might accept them as a tutorial, saying that, this feature will not be supported in vllm, but you want to have it, here is how you can modify vllm's code to achieve it, just for your own usage.

simon-mo · 2025-04-19T03:42:55Z

@Settheworldonfireiii, thank you very much for submitting the PR and keep it up to date, and thank everyone for the comment for chiming in.

The vLLM maintainers decided not to accept this PR at the moment. vLLM is designed for inference performance where the output is tokens, not tensors. In vLLM’s architecture, the client process (where you instantiate LLM class or the OpenAI API server) is separate from the worker processes running on GPU. Sending large hidden states or attention states tensors from GPU to the host CPU process significantly slow down the processing speed and complicates the data structure. We cannot think of a performant way to support this in vLLM. Adding this as optional field also means we need to support this field performantly going forward, hence we decide not to support it.

To achieve what do you want to do, we believe using Transformers is already a great option. Transformers has a wide range of model support, as well as support for variety of quantization methods.

Settheworldonfireiii requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat, zhuohan123 and youkaichao as code owners March 25, 2025 03:34

mergify bot added the v1 label Mar 25, 2025

mergify bot added the frontend label Mar 26, 2025

Settheworldonfireiii changed the title ~~Added the option of returning hidden states when doing generate()~~ Added the option of returning hidden states Mar 26, 2025

DarkLight1337 reviewed Mar 26, 2025

View reviewed changes

Settheworldonfireiii force-pushed the main branch from f8bd515 to 44cf8a1 Compare March 26, 2025 23:11

Settheworldonfireiii force-pushed the main branch from f61c77a to 7aa4c52 Compare March 27, 2025 00:03

mergify bot added the tpu Related to Google TPUs label Mar 27, 2025

Settheworldonfireiii force-pushed the main branch 4 times, most recently from 9f7c09c to 11d839a Compare March 27, 2025 03:12

WoosukKwon requested changes Apr 2, 2025

View reviewed changes

Settheworldonfireiii requested review from WoosukKwon and DarkLight1337 April 2, 2025 00:52

fixed llm engine hidden_states associated with value

9e36cf1

Signed-off-by: Settheworldonfireiii <[email protected]>

fixed llm engine hidden_states associated with value

131ceb2

Signed-off-by: Settheworldonfireiii <[email protected]>

Settheworldonfireiii force-pushed the main branch from e4c7bf9 to 131ceb2 Compare April 2, 2025 02:43

Settheworldonfireiii added 2 commits April 1, 2025 21:55

fixed llm engine hidden_states associated with value

dc4f311

Signed-off-by: Settheworldonfireiii <[email protected]>

fixed llm engine hidden_states associated with value

3944594

Signed-off-by: Settheworldonfireiii <[email protected]>

Settheworldonfireiii added 3 commits April 2, 2025 13:22

fixed styling : brace

1cea671

Signed-off-by: Settheworldonfireiii <[email protected]>

removed redundant files

320694c

fixed yapf

eda3470

Signed-off-by: Settheworldonfireiii <[email protected]>

Settheworldonfireiii force-pushed the main branch from 08406e6 to eda3470 Compare April 2, 2025 18:45

DarkLight1337 mentioned this pull request Apr 8, 2025

Support embedding models in V1 #16188

Open

mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025

DarkLight1337 mentioned this pull request Apr 14, 2025

[Usage]: How to add a hook function #16585

Open

1 task

youkaichao reviewed Apr 18, 2025

View reviewed changes

simon-mo closed this Apr 19, 2025

DarkLight1337 mentioned this pull request May 15, 2025

[RFC]: Returning Last Token, Last Layer Hidden States Through The OpenAI API #18176

Open

1 task

Uh oh!

Added the option of returning hidden states #15434

Added the option of returning hidden states #15434

Uh oh!

Conversation

Settheworldonfireiii commented Mar 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

DarkLight1337 commented Mar 25, 2025

Uh oh!

Settheworldonfireiii commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Mar 26, 2025

Uh oh!

DarkLight1337 commented Mar 26, 2025

Uh oh!

Settheworldonfireiii commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Mar 26, 2025

Uh oh!

DarkLight1337 Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

Settheworldonfireiii Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Settheworldonfireiii Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

Settheworldonfireiii Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser commented Mar 26, 2025

Uh oh!

Settheworldonfireiii commented Mar 26, 2025

Uh oh!

maxdebayser commented Mar 26, 2025

Uh oh!

Settheworldonfireiii commented Mar 26, 2025

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Settheworldonfireiii commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NishanthVAnand commented Apr 2, 2025

Uh oh!

Settheworldonfireiii commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Settheworldonfireiii commented Apr 2, 2025

Uh oh!

Settheworldonfireiii commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Settheworldonfireiii commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Apr 3, 2025

Uh oh!

TachyonGun commented Apr 4, 2025

Uh oh!

Settheworldonfireiii commented Apr 8, 2025

Uh oh!

sastpg commented Apr 9, 2025

Uh oh!

youkaichao Apr 18, 2025

Choose a reason for hiding this comment

Uh oh!

Settheworldonfireiii commented Mar 25, 2025 •

edited by github-actions bot

Loading

Settheworldonfireiii commented Mar 26, 2025 •

edited

Loading

Settheworldonfireiii commented Mar 26, 2025 •

edited

Loading

Settheworldonfireiii commented Apr 2, 2025 •

edited

Loading

Settheworldonfireiii commented Apr 2, 2025 •

edited

Loading

Settheworldonfireiii commented Apr 2, 2025 •

edited

Loading

Settheworldonfireiii commented Apr 2, 2025 •

edited

Loading

simon-mo commented Apr 19, 2025 •

edited

Loading