-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
Added the option of returning hidden states #15434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the option of returning hidden states #15434
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
cc @maxdebayser how does this fit with your implementation of #12249 ? |
I think that my PR largely duplicates #12249 , but it is much more narrow in scope, outputs only last step's last attention block's hidden states which are already extracted inside ModelRunner/GPUModelRunner and thus avoids the slowdown concerns voiced by some contributors in the discussion you mentioned. My intention was to build it in a minimally invasive way. In perspective, I was thinking of a larger update: since the rise of mechinterp, some users might want to get custom layers' hidden states, say, "I want 5th attention block's hidden states", or "I want 4-9 attention blocks' hidden states for steps 3-9" (passed as some sort of a dictionary) Now, whether to pass it as an argument into LLM(), or as a SamplingParam into generate(), is another question. Both seem fine to me, and I would like to hear your opinion. The argument for passing it into generate() is that some users may want to get custom hidden states in one call to generate() and not get it in another call to generate() of the same instance. However, implementing these custom hidden_states might be more complicated, and likely would involve either adding/modifying existing outputs processor classes or, even more, modifying the model implementation itself, and thus as @youkaichao noted, it might slow down inference. So, the alternative approach is to pass some flag/outputs processor on the instance level, and then in each inference request specify which hidden states the user wants. It will be a separate instance, very useful for mechinterp people, potentially slower, but as an optional feature, other users who don't want any hidden states will not experience any slowdown. I can implement it both ways. Can give more exact dates of when I can do it once we clarify if this line of work is needed and in which of the ways mentioned above should I implement it. The feature implemented in this PR is just a small enhancement that does not affect speed and does not offer much, but I can build more. Also, I would like to mention that despite being annotated as v1, it implements the output hidden states feature for both v0.x and v1. |
@robertgshaw2-redhat is V1 runner stable enough to add new features like this to it? Otherwise I'm thinking of limiting this PR to V0 as a POC before extending it to V1. |
Also @maxdebayser any updates on your progress? Let's coordinate this together. |
@DarkLight1337 also, for v0, I keep running into an error in my entrypoint tests which I think is not related to my PR but rather to these lines:
in https://github.com/vllm-project/vllm/blob/main/vllm/sequence.py sequence.py file, lines 1428 - 1439. When I change |
Can you merge in the latest changes on main? Perhaps it got fixed recently |
@!
Outdated
@@ -0,0 +1,2139 @@ | |||
# SPDX-License-Identifier: Apache-2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you added this file by mistake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!
I am the first time contributor, so I wonder what are my next actions?
It passed the buildkite tests and DCO, and now I also fixed the potential reason for mypy failure (even though it seemed strange that it failed mypy with the following:
Error: vllm/engine/llm_engine.py:1119: error: "list[Any]" has no attribute "hidden_states" [attr-defined]
and my line 1119 in this file is:
output = [outputs_by_sequence_group[0][i]]
)
For pre-commit tests, there are also some errors
errors:
Error: vllm/worker/model_runner.py:1719:81: E501 Line too long (85 > 80)
Error: vllm/worker/model_runner.py:1720:81: E501 Line too long (82 > 80)
Error: vllm/worker/model_runner.py:1723:81: E501 Line too long (81 > 80)
were introduced by other contributors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please install pre-commit according to here: https://docs.vllm.ai/en/latest/contributing/overview.html
Then you can run pre-commit run --all-files
to check and fix the errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ignore the errors that are not from your PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently, some of them were accidentally introduced by me..but now it is fixed and I expect it will passed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ignore the errors that are not from your PR
but now I guess everything is fixed, it passed all the tests.
@DarkLight1337 , it seems to me that returning the raw hidden states could be a special case of the hidden states processor. Currently I'm still tinkering with V1, because I don't know how long V0 will be supported, but if you prefer we could restrict the scope to V0 first to make this feature available sooner. |
I think in the #12249 the main dilemma was whether to create it per instance or per task. another question is whether returning custom hidden states, e.g. {steps: kv-cache filling + generation of 3 first characters, layers: "embedding and 3 first attention layers} can be efficiently implemented without altering the model implementation itself, because by looking at |
Returning all the hidden layers should be something that is enabled per instance in my opinion. One of the use cases of vLLM is to provide generic inference services in multi-tenant cloud environments. I think product managers would be reluctant do enable features by default whereby a few users could degrade the performance for many other requests. Since this is for a very specific use case, it should be an opt-in feature.
I think in pytorch there are hooks to inspect and trace the execution of models, so you could get the hidden states without changing the model code as long as you can identify the correct layers. |
f8bd515
to
44cf8a1
Compare
Agree.
That is one way of implementing it. Hooks can be significantly slowing down, but if there is no other way probably will go with it. I need to investigate/think of it more. |
f61c77a
to
7aa4c52
Compare
9f7c09c
to
11d839a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I think this PR needs a wider discussion. I'd like to hold it off.
@DarkLight1337 it passed all the tests |
@WoosukKwon could you please share more details on what you mean by a wider discussion? Like many mechanistic interpretability researchers, I was looking forward to this feature for some time now. |
@WoosukKwon which changes would you like to see? I could probably change it significantly starting from next week |
Signed-off-by: Settheworldonfireiii <[email protected]>
@NishanthVAnand this particular merge will only allow to output the attention block's hidden states, because of the potential slow down concerns; @maxdebayser and I were separately working on a more flexible and more informative possibilities of outputting custom hidden states of any layer, any particular step, but I have other projects and also need to finish this branch, so it will be a while.. |
Signed-off-by: Settheworldonfireiii <[email protected]>
e4c7bf9
to
131ceb2
Compare
Signed-off-by: Settheworldonfireiii <[email protected]>
Signed-off-by: Settheworldonfireiii <[email protected]>
@DarkLight1337 it passed all tests now. so I think we have 2 branches ready to merge for v0 — this and that one , unless @WoosukKwon will specify which changes he would like to see, so that I can implement it before merge. |
Signed-off-by: Settheworldonfireiii <[email protected]>
Signed-off-by: Settheworldonfireiii <[email protected]>
08406e6
to
eda3470
Compare
@NishanthVAnand as for the code you provided, with the current functionality I implemented, it is going to be something like
in version v1, supposedly it is gonna be like
|
This looks good to me now - @WoosukKwon could you explain the concerns you have regarding this PR? |
This functionality would be very useful for folks trying to do mechinterp research. Using vLLM to retrieve hidden states would be very helpful to collect datasets of model activations for training sparse autoencoders, or other kind of analyses. |
@DarkLight1337 @WoosukKwon any updates/recommendations w r t changes ? |
I find this feature is useful to me! Does the following HF transformers code
Thank you very much! I'm looking forward to this new feature. |
# overrides self.return_hidden_states that was | ||
# assigned during initialization | ||
# the rationale is giving users the option | ||
# to receive hidden states or not | ||
# from the same model w/o re-init it | ||
if (model_input.sampling_metadata is not None | ||
and hasattr(model_input.sampling_metadata, 'seq_groups') | ||
and model_input.sampling_metadata.seq_groups is not None): | ||
self.return_hidden_states = ( | ||
model_input.sampling_metadata.seq_groups[0].sampling_params. | ||
return_hidden_states) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will not work. you are putting return_hidden_states
as a request-level parameter, which means you can get a mixture of a batch, some of them requires token output, while some of them requires hidden state output. This cannot be handled well.
I don't think we will accept this feature. If you really want to use it, you can just change the code in vllm/worker/model_runner.py
to write the tensor output to a file, and you just read the file directly.
A similar ask is to get attention masks from vllm, which we will not accept, either.
We might accept them as a tutorial, saying that, this feature will not be supported in vllm, but you want to have it, here is how you can modify vllm's code to achieve it, just for your own usage.
@Settheworldonfireiii, thank you very much for submitting the PR and keep it up to date, and thank everyone for the comment for chiming in. The vLLM maintainers decided not to accept this PR at the moment. vLLM is designed for inference performance where the output is tokens, not tensors. In vLLM’s architecture, the client process (where you instantiate To achieve what do you want to do, we believe using Transformers is already a great option. Transformers has a wide range of model support, as well as support for variety of quantization methods. |
Added option of returning hidden states in response to a flag in SamplingParams
files that were modified: engine/llm_engine.py, outputs.py, v1/outputs.py, v1/worker/gpu_model_runner.py, worker/model_runner.py, sampling_params.py, v1/engine/output_processor.py, v1/core/sched/scheduler.py, sequence.py , v1/serial_utils.py, v1/engine/init.py
Will be grateful if someone runs entrypoint tests