Skip to content

[V1] Support cross-layer KV sharing #18212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sarckk
Copy link
Collaborator

@sarckk sarckk commented May 15, 2025

Motivation

Some models like Tencent-Hunyuan-Large (#10043) and Hymba-1.5B-Base (#10783) use cross-layer KV sharing (e.g. Cross-Layer Attention). This PR adds the ability for KV caches to be shared between attention layers.


Testing

Sanity Check

As a sanity check that the implementation is working, I made all layers after the 18th layer in Qwen/Qwen3-8B (36 layers total) and printed out the id() of the kv cache used in attention forward:

model.layers.0.self_attn.attn => 139678446053136
model.layers.1.self_attn.attn => 139678446059136
…
model.layers.15.self_attn.attn => 139678446045456
model.layers.16.self_attn.attn => 139678446055056
model.layers.17.self_attn.attn => 139678446050736
model.layers.18.self_attn.attn => 139678446050736
model.layers.19.self_attn.attn => 139678446050736
…
model.layers.32.self_attn.attn => 139678446050736
model.layers.33.self_attn.attn => 139678446050736
model.layers.34.self_attn.attn => 139678446050736
model.layers.35.self_attn.attn => 139678446050736 

As expected, layers 19 to 36 are re-using the KV cache allocated by layer 18.

Unit Tests

All newly added unit tests pass:

pytest tests/v1/worker/test_gpu_model_runner.py -k "test_init_kv_cache"

Evals

checked the score of gsm8k before and after my PR on Qwen/Qwen3-8B:

lm_eval --model vllm --tasks gsm8k --model_args pretrained=Qwen/Qwen3-8B,tensor_parallel_size=1 --batch_size auto

before PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8795|±  |0.0090|
|     |       |strict-match    |     5|exact_match|↑  |0.8734|±  |0.0092|

After PR:

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8802|±  |0.0089|
|     |       |strict-match    |     5|exact_match|↑  |0.8734|±  |0.0092|

also cc: @heheda12345

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added v1 tpu Related to Google TPUs labels May 15, 2025
@sarckk sarckk marked this pull request as ready for review May 15, 2025 17:31
# and value[:num_actual_tokens] because the reshape_and_cache_flash
# op uses the slot_mapping's shape to determine the number of
# actual tokens.
torch.ops._C_cache_ops.reshape_and_cache_flash(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we add for triton backend as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, seems like I missed some backends. will update

@@ -675,14 +698,17 @@ def unify_hybrid_kv_cache_specs(kv_cache_spec: dict[str, KVCacheSpec]):
if has_full_attention and has_sliding_window:
for layer_name, spec in kv_cache_spec.items():
if isinstance(spec, SlidingWindowSpec):
kv_cache_spec[layer_name] = FullAttentionSpec(
updated_spec = FullAttentionSpec(
block_size=spec.block_size,
num_kv_heads=spec.num_kv_heads,
head_size=spec.head_size,
dtype=spec.dtype,
use_mla=spec.use_mla,
sliding_window=spec.sliding_window,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assign kv_sharing_target_layer_idx in init directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is because I marked the field kv_sharing_target_layer_idx with init=False. I did this because I wanted to set a default value (None) for this field, but I cannot do this without init=False because it is added to the base parent dataclass (KVCacheSpec). If I let init=True, it will error with exception: non-default arg follows default arg. See this stackoverflow thread for more details on the nuances here.

TLDR; alternatives are 1) to decompose & refactor the inheritance hierarchy of FullAttentionSpec, SlidingWindowSpec, AttentionSpec and KVCacheSpec so we can have default fields in parent dataclasses, 2) not enforce None as default, meaning we need to update all constructions of attention spec dataclasses with explicit kv_sharing_target_layer_idx=None. Comparatively, having to set kv_sharing_target_layer_idx after init seemed fine, since it is only used in a couple of places and it is not a user-facing dataclass. But happy to consider other options folks feel is better.

@@ -100,6 +109,10 @@ def type_id(self) -> str:
return f"full_attention_{self.block_size}_{self.page_size_bytes}"

def max_memory_usage_bytes(self, vllm_config: VllmConfig) -> int:
if self.kv_sharing_target_layer_idx is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extract to parent AttentionSpec and make this an property?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv_sharing_target_layer_idx is part of KVCacheSpec, which is the parent of AttentionSpec. Are you saying we can make self.kv_sharing_target_layer_idx is not None itself a derived property of AttentionSpec?

@heheda12345 heheda12345 self-requested a review May 16, 2025 01:51
Copy link

mergify bot commented May 18, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sarckk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@sarckk
Copy link
Collaborator Author

sarckk commented May 20, 2025

entrypoints test failure is unrelated and failing on trunk (see https://buildkite.com/vllm/fastcheck/builds/24385)

@sarckk
Copy link
Collaborator Author

sarckk commented May 20, 2025

@heheda12345 could you take a look?

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my late review. Some points that I want to discuss:

  1. The name "KV sharing". Do you think "reuse" is a better name? I want to discuss more about it because in [v1] Hybrid Memory Allocator #17996, I need to let multiple layers sharing the same memory pool but with different block_ids, and I think we need to distinguish between the "sharing" in this PR and that PR. From my understanding, reuse is more accurate here because layers are not equal. The first layer updates the kv cache and the following layers just reuse the first layer. But open to discussion. We need to agree on a name and keep it consistent in this PR.
  2. One model with kv sharing should use less memory per block than another model with the same model config but without kv sharing. Where do you implement this logic now?
  3. Is KV sharing compatible with kv connectors now?
  4. I think we can make KV sharing more implicit. Basically, I think it is possible to avoid changing code inside v1/core & kv_cache_interface.py. kv_cache_manager & kv_cache_utils don’t need to know about kv sharing. They can run as if the layers without kv sharing does not exist. To mimic it, we can only return layers with kv_sharing_target_layer_idx is None in GPUModelRunner.get_kv_cache_spec.
  5. I prefer to use kv_sharing_target_layer_name than kv_sharing_target_layer_idx as it has no ambiguity. For example, in bart, we will have both decoder.layers.1.self_attn and decoder.layers.1.encoder_attn. Both layer index is 1.
  6. Add check for we only support kv sharing in v1

Comment on lines 106 to 109
self.kv_sharing_target_layer_idx = kv_sharing_target_layer_idx
if kv_sharing_target_layer_idx is not None:
extra_impl_args['kv_sharing_target_layer_idx'] = (
kv_sharing_target_layer_idx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about passing kv_sharing_target_layer_idx explicitly instead of regarding as an extra_impl_args?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this because vllm/attention/layer.py seems to be the entrypoint for attention backends in both V0 and V1. I didn't want to support this in V0 so I just added this to the constructor of attention backends in V1.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use magic code for v0 compatibility. We don't need to support kv sharing in v0. Adding check for kv_sharing_target_layer_idx is None to all attention backends in v0 is enough.

Comment on lines 575 to 576
# if reusing KV cache from earlier layer, don't update KV cache
if self.kv_sharing_target_layer_idx is None:
# Reshape the input keys and values and store them in the cache.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# if reusing KV cache from earlier layer, don't update KV cache
if self.kv_sharing_target_layer_idx is None:
# Reshape the input keys and values and store them in the cache.
if self.kv_sharing_target_layer_idx is None:
# Reshape the input keys and values and store them in the cache.
# Skip this if reusing KV cache from an earlier layer.

Same for other backends.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls unify reusing & sharing

str(target_layer_idx),
)

error_msg = textwrap.dedent(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more natural to check this in Attention.__init__ in layers.py as it is a constraint from the definition of this argument instead of a constraint of our implementation of gpu_model_runner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, I'll make this change

"kv_sharing_factor", "available_mem_gb"), [
("Qwen/Qwen1.5-7B", 16385, 16384, 0, 8),
("Qwen/Qwen1.5-7B", 16383, 16383, 0, 8),
("Qwen/Qwen1.5-7B", 16383, 16383, 2, 4),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a seperate test for test_estimate_max_model_len of kv sharing?

Copy link

mergify bot commented May 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sarckk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 21, 2025
@sarckk
Copy link
Collaborator Author

sarckk commented May 22, 2025

@heheda12345 thanks for taking a look. To answer your questions:

  1. I prefer 'KV sharing' simply because it seems to be the academic term for this kind of thing (e.g. see https://arxiv.org/abs/2410.14442), whereas 'KV reuse' seems to be used to refer to something else (e.g. prefix caching, https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/)

  2. One model with kv sharing should use less memory per block than another model with the same model config but without kv sharing.

I didn't quite understand why it would be "less memory per block". I think we'll just have less physical KV blocks being used? Here is where the core memory savings would be coming from, by not allocating if there is a target layer for KV sharing. I might be missing some other implementation details here, let's chat offline?

  1. Is KV sharing compatible with kv connectors now?

Not at the moment, I believe

  1. To mimic it, we can only return layers with kv_sharing_target_layer_idx is None

I explored this design but I remember the complexity was just offloaded to a later stage as we needed to handle KV allocation for layers without a KV cache spec anyways. But I think the APIs around KV cache groups have changed considerably since then, let me take a look again.

  1. Yes, this is a good point. Some models explicitly keep track of FQN for each layer so it shouldn't be difficult. I'll make this change.

  2. Yes, I will add this check.

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Sure. Let's use sharing. Pls unify the concept in this PR.
  2. We have less physical memory per KV block, thus we can increase num_gpu_blocks. Where is this logic?
  3. What is the blocker for making it compatible with KV connector?
  4. as we needed to handle KV allocation for layers without a KV cache spec" I think it may be possible to add a function in initialize_kv_cache to handle all logic. Basically, that function needs:
    1. pointing the Attention.kv_cache to the target layer like
      kv_caches[layer_name] = kv_caches[target_layer_name]
    2. adding the shared layer to the kv cache group of its target layer to help this loop
      for kv_cache_group_id, kv_cache_group_spec in enumerate(

But not sure whether I miss any complexity.
5 & 6: SG!
BTW, most merge conflict comes from a temporary revert #18459. I think we can just work on the current branch now without rebase.

Comment on lines 106 to 109
self.kv_sharing_target_layer_idx = kv_sharing_target_layer_idx
if kv_sharing_target_layer_idx is not None:
extra_impl_args['kv_sharing_target_layer_idx'] = (
kv_sharing_target_layer_idx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use magic code for v0 compatibility. We don't need to support kv sharing in v0. Adding check for kv_sharing_target_layer_idx is None to all attention backends in v0 is enough.

@sarckk
Copy link
Collaborator Author

sarckk commented May 28, 2025

@heheda12345 updated the PR with your feedback. could you take a look?

overview of changes:

  • removed all references to "reuse" and unified to using the term "sharing"
  • standardize on using layer FQN (kv_sharing_target_layer_name) instead of layer index (kv_sharing_target_layer_idx) to avoid ambiguities.
  • as suggested, make KV sharing implicit by only returning layers without in get_kv_cache_spec method of model runner (GPU and TPU).
  • with this design, the logic of "We have less physical memory per KV block, thus we can increase num_gpu_blocks" is handled implicitly below, added comments to explain why with cross-layer KV sharing we can allocate more GPU blocks
    # with cross-layer KV sharing, len(kv_cache_spec) may be less than no. of
    # attention layers, in which case more KV cache blocks can be allocated
    num_blocks = int(available_memory // page_size // len(kv_cache_spec))
  • Added target layer validation and V1-only support check at the Attention layer level.

I think the design is overall cleaner than before, thanks for the feedback. To answer your question,

What is the blocker for making it compatible with KV connector?

I looked into this a bit more. I think the current design is actually compatible with KV connector but we need to keep maybe_save_kv_layer_to_connector calls in each attention forward even for layers that don't allocate KV cache, because the KV connector isn't aware of shared caches. For example, for the simple SharedStorageConnector that writes/saves to/from disk, it will save the KV cache for each layer in separate safetensor files, so we need to save each KV layer. LMCache's connector also seemed OK to handle kv_caches with shared memory pointers. I'm not sure if other connectors are compatible with cross-layer KV sharing, though. For now I think we can allow it to be used together, but let me know if any concerns.

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. I left some small comments.
For KV connector, can you at least try https://github.com/vllm-project/vllm/tree/main/examples/offline_inference/disaggregated-prefill-v1 with a local model?

Comment on lines 310 to 311
assert kv_sharing_target_layer_name is None, NotImplementedError(
"KV sharing is not supported in V0.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert kv_sharing_target_layer_name is None, NotImplementedError(
"KV sharing is not supported in V0.")
assert kv_sharing_target_layer_name is None, "KV sharing is not supported in V0."

The current code is confusing because of the mix of AssertionError & NotImplementedError. Same for other attention backends.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated code to unify around NotImplementedError instead of asserts.

Comment on lines 72 to 73
comp_str = ("is equal to" if current_layer_idx
== target_layer_idx else "comes after")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you simplify the logic here? I think we don't need to distinguish between "is equal to" and "comes after".

Comment on lines 24 to 25
f"KV sharing target layer for {layer_name} not valid. "
f"{kv_target_layer_name} is not a Attention layer in the model.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"KV sharing target layer for {layer_name} not valid. "
f"{kv_target_layer_name} is not a Attention layer in the model.")
f"KV sharing target layer for {layer_name} is not valid. "
f"{kv_target_layer_name} is not an Attention layer in the model.")

@@ -621,6 +621,8 @@ def _get_kv_cache_config_uniform_type(vllm_config: VllmConfig,
assert len(page_sizes) == 1
page_size = page_sizes.pop()

# with cross-layer KV sharing, len(kv_cache_spec) may be less than no. of
# attention layers, in which case more KV cache blocks can be allocated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People reading this part of code do not be aware of kv sharing. I think it's better to explain the implementation of kv sharing here.

if (kv_tgt_layer :=
                     attn_module.kv_sharing_target_layer_name) is not None:
                 validate_kv_target_layer(layer_name, kv_tgt_layer, layers)
                 # KV cache is shared with earlier layer, don't create a spec
                 self.shared_kv_cache_layers[layer_name] = kv_tgt_layer
                 continue

Comment on lines 2034 to 2041
# with KV sharing, some layers can share KV caches with earlier layers
for layer_name, target_layer_name in self.shared_kv_cache_layers.items(
):
kv_caches[layer_name] = kv_caches[target_layer_name]
group_idx = layer_to_kv_cache_group_idx[target_layer_name]
# attention metadata is assigned for each layer in layer_names
kv_cache_config.kv_cache_groups[group_idx].layer_names.append(
layer_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you wrap all logic in a utility function?

Comment on lines 2073 to 2074
validate_kv_target_layer(layer_name, kv_tgt_layer, layers)
# KV cache is shared with earlier layer, don't create a spec
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can you explain how kv sharing is implemented here?
  2. Can you merge the validation here to those in attention/layer.py? If the code for validation is too long, I suggest to create a utility function there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for 2), the reason I added the validation in the model runner is that the attention layer currently doesn't have info about the rest of the layers in the model (so we cannot validate that it has the same attn type or that it exists).

We could pass it as an additional arg to the layer but a) I'm not sure it makes sense for each layer to be aware of others and b) user would have to pass in two args (layer info and target layer name) for KV sharing to work. does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can access the target layer with this static forward context. Is it enought?

compilation_config.static_forward_context[prefix] = self

Copy link

mergify bot commented May 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sarckk.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label May 29, 2025
sarckk added 3 commits May 29, 2025 16:41
Signed-off-by: Yong Hoon Shin <[email protected]>
Signed-off-by: Yong Hoon Shin <[email protected]>
@sarckk
Copy link
Collaborator Author

sarckk commented May 30, 2025

updated to address comments.

For KV connector, can you at least try https://github.com/vllm-project/vllm/tree/main/examples/offline_inference/disaggregated-prefill-v1 with a local model?

tried and it still works

Copy link
Collaborator

@heheda12345 heheda12345 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Really appreciate the detailed tests and input verification. I think this PR is good except some very small items.

@@ -2061,6 +2065,13 @@ def initialize_kv_cache(self, kv_cache_config: KVCacheConfig) -> None:
# KV cache specs.
raise ValueError("Unknown KV cache spec type.")

add_shared_kv_layers(
Copy link
Collaborator

@heheda12345 heheda12345 May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
add_shared_kv_layers(
# Setup `kv_cache_config` and `kv_caches` for models with cross-layer KV sharing
if self.shared_kv_cache_layers:
initialize_kv_cache_for_kv_sharing(

I prefer to add a branch so that people can skip reading the function when not considering such models.
And can you put line 2018 and 2059 into this utility function?

kv_caches: dict[str, torch.Tensor],
kv_cache_groups: list[KVCacheGroupSpec],
layer_to_kv_cache_group_idx: dict[str, int],
) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add docstring to this function?

@@ -275,6 +275,8 @@ def __init__(
pin_memory=self.pin_memory)
self.seq_lens_np = self.seq_lens_cpu.numpy()

self.shared_kv_cache_layers: dict[str, str] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment for the "direction" of layer sharing here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tpu Related to Google TPUs v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants