Skip to content

Added the option of returning hidden states #15434

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

Settheworldonfireiii
Copy link

@Settheworldonfireiii Settheworldonfireiii commented Mar 25, 2025

Added option of returning hidden states in response to a flag in SamplingParams

files that were modified: engine/llm_engine.py, outputs.py, v1/outputs.py, v1/worker/gpu_model_runner.py, worker/model_runner.py, sampling_params.py, v1/engine/output_processor.py, v1/core/sched/scheduler.py, sequence.py , v1/serial_utils.py, v1/engine/init.py

Will be grateful if someone runs entrypoint tests

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Mar 25, 2025
@DarkLight1337
Copy link
Member

cc @maxdebayser how does this fit with your implementation of #12249 ?

@mergify mergify bot added the frontend label Mar 26, 2025
@Settheworldonfireiii Settheworldonfireiii changed the title Added the option of returning hidden states when doing generate() Added the option of returning hidden states Mar 26, 2025
@Settheworldonfireiii
Copy link
Author

Settheworldonfireiii commented Mar 26, 2025

cc @maxdebayser how does this fit with your implementation of #12249 ?

I think that my PR largely duplicates #12249 , but it is much more narrow in scope, outputs only last step's last attention block's hidden states which are already extracted inside ModelRunner/GPUModelRunner and thus avoids the slowdown concerns voiced by some contributors in the discussion you mentioned. My intention was to build it in a minimally invasive way.

In perspective, I was thinking of a larger update: since the rise of mechinterp, some users might want to get custom layers' hidden states, say, "I want 5th attention block's hidden states", or "I want 4-9 attention blocks' hidden states for steps 3-9" (passed as some sort of a dictionary)

Now, whether to pass it as an argument into LLM(), or as a SamplingParam into generate(), is another question. Both seem fine to me, and I would like to hear your opinion.

The argument for passing it into generate() is that some users may want to get custom hidden states in one call to generate() and not get it in another call to generate() of the same instance.

However, implementing these custom hidden_states might be more complicated, and likely would involve either adding/modifying existing outputs processor classes or, even more, modifying the model implementation itself, and thus as @youkaichao noted, it might slow down inference.

So, the alternative approach is to pass some flag/outputs processor on the instance level, and then in each inference request specify which hidden states the user wants. It will be a separate instance, very useful for mechinterp people, potentially slower, but as an optional feature, other users who don't want any hidden states will not experience any slowdown.

I can implement it both ways. Can give more exact dates of when I can do it once we clarify if this line of work is needed and in which of the ways mentioned above should I implement it.

The feature implemented in this PR is just a small enhancement that does not affect speed and does not offer much, but I can build more.

Also, I would like to mention that despite being annotated as v1, it implements the output hidden states feature for both v0.x and v1.

@DarkLight1337
Copy link
Member

@robertgshaw2-redhat is V1 runner stable enough to add new features like this to it? Otherwise I'm thinking of limiting this PR to V0 as a POC before extending it to V1.

@DarkLight1337
Copy link
Member

Also @maxdebayser any updates on your progress? Let's coordinate this together.

@Settheworldonfireiii
Copy link
Author

Settheworldonfireiii commented Mar 26, 2025

@DarkLight1337 also, for v0, I keep running into an error in my entrypoint tests which I think is not related to my PR but rather to these lines:

seq_group = engine._add_processed_request( request_id_i, params=params, **kwargs, ) # type: ignore

in https://github.com/vllm-project/vllm/blob/main/vllm/sequence.py sequence.py file, lines 1428 - 1439.

When I change request_id_i to request_id , locally the test that fails in the buildkite runs well, but I want to double-check since it was not me who introduced this request_id_i and I do not fully understand what is the reason behind it

@DarkLight1337
Copy link
Member

Can you merge in the latest changes on main? Perhaps it got fixed recently

@! Outdated
@@ -0,0 +1,2139 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you added this file by mistake

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

I am the first time contributor, so I wonder what are my next actions?

It passed the buildkite tests and DCO, and now I also fixed the potential reason for mypy failure (even though it seemed strange that it failed mypy with the following:

Error: vllm/engine/llm_engine.py:1119: error: "list[Any]" has no attribute "hidden_states" [attr-defined]

and my line 1119 in this file is:

output = [outputs_by_sequence_group[0][i]]
)

For pre-commit tests, there are also some errors
errors:

 Error: vllm/worker/model_runner.py:1719:81: E501 Line too long (85 > 80)
Error: vllm/worker/model_runner.py:1720:81: E501 Line too long (82 > 80)
Error: vllm/worker/model_runner.py:1723:81: E501 Line too long (81 > 80)

were introduced by other contributors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please install pre-commit according to here: https://docs.vllm.ai/en/latest/contributing/overview.html

Then you can run pre-commit run --all-files to check and fix the errors

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore the errors that are not from your PR

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently, some of them were accidentally introduced by me..but now it is fixed and I expect it will passed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ignore the errors that are not from your PR

but now I guess everything is fixed, it passed all the tests.

@maxdebayser
Copy link
Contributor

@DarkLight1337 , it seems to me that returning the raw hidden states could be a special case of the hidden states processor. Currently I'm still tinkering with V1, because I don't know how long V0 will be supported, but if you prefer we could restrict the scope to V0 first to make this feature available sooner.

@Settheworldonfireiii
Copy link
Author

@DarkLight1337 , it seems to me that returning the raw hidden states could be a special case of the hidden states processor. Currently I'm still tinkering with V1, because I don't know how long V0 will be supported, but if you prefer we could restrict the scope to V0 first to make this feature available sooner.

I think in the #12249 the main dilemma was whether to create it per instance or per task.

another question is whether returning custom hidden states, e.g. {steps: kv-cache filling + generation of 3 first characters, layers: "embedding and 3 first attention layers} can be efficiently implemented without altering the model implementation itself, because by looking at vllm/model_executor/models/ , most of actual model implementations do not seem to accumulate /save hidden states of layers, and the final output of the model execution outputs.

@maxdebayser
Copy link
Contributor

I think in the #12249 the main dilemma was whether to create it per instance or per task.

Returning all the hidden layers should be something that is enabled per instance in my opinion. One of the use cases of vLLM is to provide generic inference services in multi-tenant cloud environments. I think product managers would be reluctant do enable features by default whereby a few users could degrade the performance for many other requests. Since this is for a very specific use case, it should be an opt-in feature.

another question is whether returning custom hidden states, e.g. {steps: kv-cache filling + generation of 3 first characters, layers: "embedding and 3 first attention layers} can be efficiently implemented without altering the model implementation itself, because by looking at vllm/model_executor/models/ , most of actual model implementations do not seem to accumulate /save hidden states of layers, and the final output of the model execution outputs.

I think in pytorch there are hooks to inspect and trace the execution of models, so you could get the hidden states without changing the model code as long as you can identify the correct layers.

@Settheworldonfireiii
Copy link
Author

Returning all the hidden layers should be something that is enabled per instance in my opinion. One of the use cases of vLLM is to provide generic inference services in multi-tenant cloud environments. I think product managers would be reluctant do enable features by default whereby a few users could degrade the performance for many other requests. Since this is for a very specific use case, it should be an opt-in feature.

Agree.

I think in pytorch there are hooks to inspect and trace the execution of models, so you could get the hidden states without changing the model code as long as you can identify the correct layers.

That is one way of implementing it. Hooks can be significantly slowing down, but if there is no other way probably will go with it. I need to investigate/think of it more.

@mergify mergify bot added the tpu Related to Google TPUs label Mar 27, 2025
@Settheworldonfireiii Settheworldonfireiii force-pushed the main branch 4 times, most recently from 9f7c09c to 11d839a Compare March 27, 2025 03:12
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I think this PR needs a wider discussion. I'd like to hold it off.

@Settheworldonfireiii
Copy link
Author

Settheworldonfireiii commented Apr 2, 2025

PTAL

@DarkLight1337
I am working on it. As an alternative, maybe we merge this branch?

0b50102

it passed all the tests

@NishanthVAnand
Copy link

Thanks for the PR. I think this PR needs a wider discussion. I'd like to hold it off.

@WoosukKwon could you please share more details on what you mean by a wider discussion? Like many mechanistic interpretability researchers, I was looking forward to this feature for some time now.

@Settheworldonfireiii
Copy link
Author

Settheworldonfireiii commented Apr 2, 2025

Thanks for the PR. I think this PR needs a wider discussion. I'd like to hold it off.

@WoosukKwon which changes would you like to see?

I could probably change it significantly starting from next week

@Settheworldonfireiii
Copy link
Author

When this commit is reviewed and merged, what does the below HF transformers code translate to in order to get the embeddings? And, how much speed up can we expect from vLLMs?

local_dir = "/network/weights/llama.var/llama_3.1/Meta-Llama-3.1-8B-Instruct/"
llm_pretrained = AutoModelForCausalLM.from_pretrained(local_dir, torch_dtype=torch.float16).to(device)
tokenizer = AutoTokenizer.from_pretrained(local_dir, torch_dtype=torch.float16)

batch_tokens = tokenizer(text_obs, return_tensors="pt", add_special_tokens=False, padding=True, truncation=True).to(llm_pretrained.device)
with torch.no_grad():
    hidden_states = llm_pretrained(**batch_tokens, output_hidden_states=True)["hidden_states"][layer]

@NishanthVAnand this particular merge will only allow to output the attention block's hidden states, because of the potential slow down concerns;

@maxdebayser and I were separately working on a more flexible and more informative possibilities of outputting custom hidden states of any layer, any particular step, but I have other projects and also need to finish this branch, so it will be a while..

@Settheworldonfireiii
Copy link
Author

Settheworldonfireiii commented Apr 2, 2025

@DarkLight1337 it passed all tests now.

so I think we have 2 branches ready to merge for v0 — this and that one , unless @WoosukKwon will specify which changes he would like to see, so that I can implement it before merge.

Signed-off-by: Settheworldonfireiii <[email protected]>
Signed-off-by: Settheworldonfireiii <[email protected]>
@Settheworldonfireiii
Copy link
Author

Settheworldonfireiii commented Apr 2, 2025

@NishanthVAnand as for the code you provided, with the current functionality I implemented, it is going to be something like

  model = LLM(
        model_name, 
        #additional arguments like tensor_parallel_size etc)
  sampling_params = SamplingParams(
            # some other arguments, like max_tokens, min_tokens, stop_token_ids, etc
            return_hidden_states = True
        )
  o = model.generate(
            prompt,
            sampling_params=sampling_params,
        )
  #to print out hidden states
  print(o[0].hidden_states)

in version v1, supposedly it is gonna be like

  model = LLM(
        model_name, 
        #additional arguments like tensor_parallel_size etc)
  sampling_params = SamplingParams(
            # some other arguments, like max_tokens, min_tokens, stop_token_ids, etc
            hidden_states_to_return = {"layers" : {0:3}, "sublayers" : {"attention", "fc1"}, steps : { "prefill", "first", "14 "}
          )
  # meaning you will return only the output of the first 3 attention block's MHA  and the output of the first
  # fully connected layer within these attention blocks, and only the kv cache prefill and first 14 generation
  # steps 
  o = model.generate(
            prompt,
            sampling_params=sampling_params,
        )
  #to print out hidden states
  print(o[0].hidden_states)

@DarkLight1337
Copy link
Member

This looks good to me now - @WoosukKwon could you explain the concerns you have regarding this PR?

@TachyonGun
Copy link

This functionality would be very useful for folks trying to do mechinterp research. Using vLLM to retrieve hidden states would be very helpful to collect datasets of model activations for training sparse autoencoders, or other kind of analyses.

@Settheworldonfireiii
Copy link
Author

@DarkLight1337 @WoosukKwon any updates/recommendations w r t changes ?

@sastpg
Copy link

sastpg commented Apr 9, 2025

I find this feature is useful to me!

Does the following HF transformers code generation_output.hidden_states produce the same output as o[0].hidden_states?

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    AutoConfig,
)
config = AutoConfig.from_pretrained(MODEL_REPO + args.model_name, **config_kwargs)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_REPO + args.model_name,
    config=config,
    torch_dtype=torch.float32,
    device_map='auto',
    trust_remote_code=True
)
generation_output = self.model.generate(
    input_ids=input_ids.to(device),
    generation_config=self.generation_config,
    max_new_tokens=self.max_output_token,
    output_attentions=True,
    output_hidden_states=True,
    output_scores=True,
    do_sample=False,
)

print(generation_output.hidden_states)     # output_len x layer_num x sampling_num x beam_search x hidden_dim

Thank you very much! I'm looking forward to this new feature.

@mergify mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025
Comment on lines +1718 to +1729
# overrides self.return_hidden_states that was
# assigned during initialization
# the rationale is giving users the option
# to receive hidden states or not
# from the same model w/o re-init it
if (model_input.sampling_metadata is not None
and hasattr(model_input.sampling_metadata, 'seq_groups')
and model_input.sampling_metadata.seq_groups is not None):
self.return_hidden_states = (
model_input.sampling_metadata.seq_groups[0].sampling_params.
return_hidden_states)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work. you are putting return_hidden_states as a request-level parameter, which means you can get a mixture of a batch, some of them requires token output, while some of them requires hidden state output. This cannot be handled well.

I don't think we will accept this feature. If you really want to use it, you can just change the code in vllm/worker/model_runner.py to write the tensor output to a file, and you just read the file directly.

A similar ask is to get attention masks from vllm, which we will not accept, either.

We might accept them as a tutorial, saying that, this feature will not be supported in vllm, but you want to have it, here is how you can modify vllm's code to achieve it, just for your own usage.

@simon-mo
Copy link
Collaborator

simon-mo commented Apr 19, 2025

@Settheworldonfireiii, thank you very much for submitting the PR and keep it up to date, and thank everyone for the comment for chiming in.

The vLLM maintainers decided not to accept this PR at the moment. vLLM is designed for inference performance where the output is tokens, not tensors. In vLLM’s architecture, the client process (where you instantiate LLM class or the OpenAI API server) is separate from the worker processes running on GPU. Sending large hidden states or attention states tensors from GPU to the host CPU process significantly slow down the processing speed and complicates the data structure. We cannot think of a performant way to support this in vLLM. Adding this as optional field also means we need to support this field performantly going forward, hence we decide not to support it.

To achieve what do you want to do, we believe using Transformers is already a great option. Transformers has a wide range of model support, as well as support for variety of quantization methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants