-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[V1] Support cross-layer KV sharing #18212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
# and value[:num_actual_tokens] because the reshape_and_cache_flash | ||
# op uses the slot_mapping's shape to determine the number of | ||
# actual tokens. | ||
torch.ops._C_cache_ops.reshape_and_cache_flash( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add for triton backend as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, seems like I missed some backends. will update
vllm/v1/core/kv_cache_utils.py
Outdated
@@ -675,14 +698,17 @@ def unify_hybrid_kv_cache_specs(kv_cache_spec: dict[str, KVCacheSpec]): | |||
if has_full_attention and has_sliding_window: | |||
for layer_name, spec in kv_cache_spec.items(): | |||
if isinstance(spec, SlidingWindowSpec): | |||
kv_cache_spec[layer_name] = FullAttentionSpec( | |||
updated_spec = FullAttentionSpec( | |||
block_size=spec.block_size, | |||
num_kv_heads=spec.num_kv_heads, | |||
head_size=spec.head_size, | |||
dtype=spec.dtype, | |||
use_mla=spec.use_mla, | |||
sliding_window=spec.sliding_window, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assign kv_sharing_target_layer_idx in init directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because I marked the field kv_sharing_target_layer_idx
with init=False
. I did this because I wanted to set a default value (None
) for this field, but I cannot do this without init=False
because it is added to the base parent dataclass (KVCacheSpec
). If I let init=True
, it will error with exception: non-default arg follows default arg. See this stackoverflow thread for more details on the nuances here.
TLDR; alternatives are 1) to decompose & refactor the inheritance hierarchy of FullAttentionSpec
, SlidingWindowSpec
, AttentionSpec
and KVCacheSpec
so we can have default fields in parent dataclasses, 2) not enforce None
as default, meaning we need to update all constructions of attention spec dataclasses with explicit kv_sharing_target_layer_idx=None
. Comparatively, having to set kv_sharing_target_layer_idx
after init seemed fine, since it is only used in a couple of places and it is not a user-facing dataclass. But happy to consider other options folks feel is better.
vllm/v1/kv_cache_interface.py
Outdated
@@ -100,6 +109,10 @@ def type_id(self) -> str: | |||
return f"full_attention_{self.block_size}_{self.page_size_bytes}" | |||
|
|||
def max_memory_usage_bytes(self, vllm_config: VllmConfig) -> int: | |||
if self.kv_sharing_target_layer_idx is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract to parent AttentionSpec
and make this an property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kv_sharing_target_layer_idx
is part of KVCacheSpec
, which is the parent of AttentionSpec
. Are you saying we can make self.kv_sharing_target_layer_idx is not None
itself a derived property of AttentionSpec
?
This pull request has merge conflicts that must be resolved before it can be |
entrypoints test failure is unrelated and failing on trunk (see https://buildkite.com/vllm/fastcheck/builds/24385) |
@heheda12345 could you take a look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my late review. Some points that I want to discuss:
- The name "KV sharing". Do you think "reuse" is a better name? I want to discuss more about it because in [v1] Hybrid Memory Allocator #17996, I need to let multiple layers sharing the same memory pool but with different
block_ids
, and I think we need to distinguish between the "sharing" in this PR and that PR. From my understanding,reuse
is more accurate here because layers are not equal. The first layer updates the kv cache and the following layers just reuse the first layer. But open to discussion. We need to agree on a name and keep it consistent in this PR. - One model with kv sharing should use less memory per block than another model with the same model config but without kv sharing. Where do you implement this logic now?
- Is KV sharing compatible with kv connectors now?
- I think we can make KV sharing more implicit. Basically, I think it is possible to avoid changing code inside v1/core & kv_cache_interface.py. kv_cache_manager & kv_cache_utils don’t need to know about kv sharing. They can run as if the layers without kv sharing does not exist. To mimic it, we can only return layers with
kv_sharing_target_layer_idx is None
in GPUModelRunner.get_kv_cache_spec. - I prefer to use
kv_sharing_target_layer_name
thankv_sharing_target_layer_idx
as it has no ambiguity. For example, in bart, we will have bothdecoder.layers.1.self_attn
anddecoder.layers.1.encoder_attn
. Both layer index is 1. - Add check for we only support kv sharing in v1
vllm/attention/layer.py
Outdated
self.kv_sharing_target_layer_idx = kv_sharing_target_layer_idx | ||
if kv_sharing_target_layer_idx is not None: | ||
extra_impl_args['kv_sharing_target_layer_idx'] = ( | ||
kv_sharing_target_layer_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about passing kv_sharing_target_layer_idx
explicitly instead of regarding as an extra_impl_args
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this because vllm/attention/layer.py
seems to be the entrypoint for attention backends in both V0 and V1. I didn't want to support this in V0 so I just added this to the constructor of attention backends in V1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not use magic code for v0 compatibility. We don't need to support kv sharing in v0. Adding check for kv_sharing_target_layer_idx is None
to all attention backends in v0 is enough.
# if reusing KV cache from earlier layer, don't update KV cache | ||
if self.kv_sharing_target_layer_idx is None: | ||
# Reshape the input keys and values and store them in the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# if reusing KV cache from earlier layer, don't update KV cache | |
if self.kv_sharing_target_layer_idx is None: | |
# Reshape the input keys and values and store them in the cache. | |
if self.kv_sharing_target_layer_idx is None: | |
# Reshape the input keys and values and store them in the cache. | |
# Skip this if reusing KV cache from an earlier layer. |
Same for other backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pls unify reusing & sharing
vllm/v1/worker/gpu_model_runner.py
Outdated
str(target_layer_idx), | ||
) | ||
|
||
error_msg = textwrap.dedent( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is more natural to check this in Attention.__init__
in layers.py as it is a constraint from the definition of this argument instead of a constraint of our implementation of gpu_model_runner
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you, I'll make this change
tests/v1/core/test_kv_cache_utils.py
Outdated
"kv_sharing_factor", "available_mem_gb"), [ | ||
("Qwen/Qwen1.5-7B", 16385, 16384, 0, 8), | ||
("Qwen/Qwen1.5-7B", 16383, 16383, 0, 8), | ||
("Qwen/Qwen1.5-7B", 16383, 16383, 2, 4), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a seperate test for test_estimate_max_model_len
of kv sharing?
This pull request has merge conflicts that must be resolved before it can be |
@heheda12345 thanks for taking a look. To answer your questions:
I didn't quite understand why it would be "less memory per block". I think we'll just have less physical KV blocks being used? Here is where the core memory savings would be coming from, by not allocating if there is a target layer for KV sharing. I might be missing some other implementation details here, let's chat offline?
Not at the moment, I believe
I explored this design but I remember the complexity was just offloaded to a later stage as we needed to handle KV allocation for layers without a KV cache spec anyways. But I think the APIs around KV cache groups have changed considerably since then, let me take a look again.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Sure. Let's use sharing. Pls unify the concept in this PR.
- We have less physical memory per KV block, thus we can increase
num_gpu_blocks
. Where is this logic? - What is the blocker for making it compatible with KV connector?
- as we needed to handle KV allocation for layers without a KV cache spec" I think it may be possible to add a function in
initialize_kv_cache
to handle all logic. Basically, that function needs:- pointing the Attention.kv_cache to the target layer like
vllm/vllm/v1/worker/gpu_model_runner.py
Line 2004 in b0d8b59
kv_caches[layer_name] = kv_caches[target_layer_name] - adding the shared layer to the kv cache group of its target layer to help this loop
vllm/vllm/v1/worker/gpu_model_runner.py
Line 620 in b0d8b59
for kv_cache_group_id, kv_cache_group_spec in enumerate(
- pointing the Attention.kv_cache to the target layer like
But not sure whether I miss any complexity.
5 & 6: SG!
BTW, most merge conflict comes from a temporary revert #18459. I think we can just work on the current branch now without rebase.
vllm/attention/layer.py
Outdated
self.kv_sharing_target_layer_idx = kv_sharing_target_layer_idx | ||
if kv_sharing_target_layer_idx is not None: | ||
extra_impl_args['kv_sharing_target_layer_idx'] = ( | ||
kv_sharing_target_layer_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not use magic code for v0 compatibility. We don't need to support kv sharing in v0. Adding check for kv_sharing_target_layer_idx is None
to all attention backends in v0 is enough.
@heheda12345 updated the PR with your feedback. could you take a look? overview of changes:
I think the design is overall cleaner than before, thanks for the feedback. To answer your question,
I looked into this a bit more. I think the current design is actually compatible with KV connector but we need to keep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general. I left some small comments.
For KV connector, can you at least try https://github.com/vllm-project/vllm/tree/main/examples/offline_inference/disaggregated-prefill-v1 with a local model?
assert kv_sharing_target_layer_name is None, NotImplementedError( | ||
"KV sharing is not supported in V0.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert kv_sharing_target_layer_name is None, NotImplementedError( | |
"KV sharing is not supported in V0.") | |
assert kv_sharing_target_layer_name is None, "KV sharing is not supported in V0." |
The current code is confusing because of the mix of AssertionError & NotImplementedError. Same for other attention backends.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated code to unify around NotImplementedError
instead of asserts.
vllm/attention/layer.py
Outdated
comp_str = ("is equal to" if current_layer_idx | ||
== target_layer_idx else "comes after") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you simplify the logic here? I think we don't need to distinguish between "is equal to" and "comes after".
vllm/v1/attention/backends/utils.py
Outdated
f"KV sharing target layer for {layer_name} not valid. " | ||
f"{kv_target_layer_name} is not a Attention layer in the model.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"KV sharing target layer for {layer_name} not valid. " | |
f"{kv_target_layer_name} is not a Attention layer in the model.") | |
f"KV sharing target layer for {layer_name} is not valid. " | |
f"{kv_target_layer_name} is not an Attention layer in the model.") |
vllm/v1/core/kv_cache_utils.py
Outdated
@@ -621,6 +621,8 @@ def _get_kv_cache_config_uniform_type(vllm_config: VllmConfig, | |||
assert len(page_sizes) == 1 | |||
page_size = page_sizes.pop() | |||
|
|||
# with cross-layer KV sharing, len(kv_cache_spec) may be less than no. of | |||
# attention layers, in which case more KV cache blocks can be allocated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People reading this part of code do not be aware of kv sharing. I think it's better to explain the implementation of kv sharing here.
if (kv_tgt_layer :=
attn_module.kv_sharing_target_layer_name) is not None:
validate_kv_target_layer(layer_name, kv_tgt_layer, layers)
# KV cache is shared with earlier layer, don't create a spec
self.shared_kv_cache_layers[layer_name] = kv_tgt_layer
continue
vllm/v1/worker/gpu_model_runner.py
Outdated
# with KV sharing, some layers can share KV caches with earlier layers | ||
for layer_name, target_layer_name in self.shared_kv_cache_layers.items( | ||
): | ||
kv_caches[layer_name] = kv_caches[target_layer_name] | ||
group_idx = layer_to_kv_cache_group_idx[target_layer_name] | ||
# attention metadata is assigned for each layer in layer_names | ||
kv_cache_config.kv_cache_groups[group_idx].layer_names.append( | ||
layer_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you wrap all logic in a utility function?
vllm/v1/worker/gpu_model_runner.py
Outdated
validate_kv_target_layer(layer_name, kv_tgt_layer, layers) | ||
# KV cache is shared with earlier layer, don't create a spec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can you explain how kv sharing is implemented here?
- Can you merge the validation here to those in
attention/layer.py
? If the code for validation is too long, I suggest to create a utility function there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for 2), the reason I added the validation in the model runner is that the attention layer currently doesn't have info about the rest of the layers in the model (so we cannot validate that it has the same attn type or that it exists).
We could pass it as an additional arg to the layer but a) I'm not sure it makes sense for each layer to be aware of others and b) user would have to pass in two args (layer info and target layer name) for KV sharing to work. does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can access the target layer with this static forward context. Is it enought?
Line 152 in 1661a9c
compilation_config.static_forward_context[prefix] = self |
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
updated to address comments.
tried and it still works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job! Really appreciate the detailed tests and input verification. I think this PR is good except some very small items.
@@ -2061,6 +2065,13 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None: | |||
# KV cache specs. | |||
raise ValueError("Unknown KV cache spec type.") | |||
|
|||
add_shared_kv_layers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_shared_kv_layers( | |
# Setup `kv_cache_config` and `kv_caches` for models with cross-layer KV sharing | |
if self.shared_kv_cache_layers: | |
initialize_kv_cache_for_kv_sharing( |
I prefer to add a branch so that people can skip reading the function when not considering such models.
And can you put line 2018 and 2059 into this utility function?
kv_caches: dict[str, torch.Tensor], | ||
kv_cache_groups: list[KVCacheGroupSpec], | ||
layer_to_kv_cache_group_idx: dict[str, int], | ||
) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add docstring to this function?
@@ -275,6 +275,8 @@ def __init__( | |||
pin_memory=self.pin_memory) | |||
self.seq_lens_np = self.seq_lens_cpu.numpy() | |||
|
|||
self.shared_kv_cache_layers: dict[str, str] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment for the "direction" of layer sharing here?
Motivation
Some models like Tencent-Hunyuan-Large (#10043) and Hymba-1.5B-Base (#10783) use cross-layer KV sharing (e.g. Cross-Layer Attention). This PR adds the ability for KV caches to be shared between attention layers.
Testing
Sanity Check
As a sanity check that the implementation is working, I made all layers after the 18th layer in Qwen/Qwen3-8B (36 layers total) and printed out the id() of the kv cache used in attention forward:
As expected, layers 19 to 36 are re-using the KV cache allocated by layer 18.
Unit Tests
All newly added unit tests pass:
Evals
checked the score of gsm8k before and after my PR on Qwen/Qwen3-8B:
before PR:
After PR:
also cc: @heheda12345