[RFC]: Hybrid Memory Allocator #11382

heheda12345 · 2024-12-20T18:46:19Z

Motivation.

In addition to standard self-attention only models, we now having more and more hybrid models with more than one type of layer, for example:

Sliding window attention + self attention: Gemma2 Ministral ([Feature]: Alternating local-global attention layers #9464)
Cross attention + self attention: mllama
Mamba layer + self attention: Jamba, Bamba

The KV cache size of different tokens are no longer the same in the above models. However, vLLM can only allocate the same KV cache size for all tokens, as shown in the below figure (mllama with BLOCK_SIZE=1). The memory waste can be 79.6% in mllama, 25% in Gemma-2, and 56.25% in Ministral.

And for mamba layers, vLLM has a special MambaCacheManager in model_executor/models/mamba_cache.py, which is not compatible with prefix caching.

We want a new memory manager to:

Allocate kv cache for these models with minimum fragmentation
Support prefix caching for these models

We can support them mainly in two milestones:
Milestone 1: per-layer memory allocation, each layer has the same kv cache size per token
In this milestone, we assume that all layers have the same kv_hidden_size but different number of tokens due to different layer type (e.g., encoder, sliding window). Then, each layer can have the same page size, which greatly simplifies the design of memory allocator. This assumption can cover almost all models in current vLLM (except Jamba).

We set page size as [BLOCK_SIZE, kv_hidden_size] (current vLLM is [num_layer, BLOCK_SIZE, kv_hidden_size]). For each request, we call the memory allocator num_layer times to get a block table for each layer. Then, each layer will have a different number of slots based on layer type, and different slot mapping.

The software architecture will be as following, with the memory manager for each layer be one of [SelfAttentionManager, SlidingWindowManager, MambaManager, CrossAttentionManager …]:

To make the per-layer memory allocation faster, we can group the layers to satisfy the following properties, so that layers inside each group share the same block table.

Layers in each group need to have the same number of tokens
All groups have the same number of layers

For example, mllama 11B has 8 cross-attention layer & 32 self-attention layer. Will be grouped to 5 groups: (cross attention layer $\times$ 8) (self-attn $\times$ 8) (self-attn $\times$ 8) (self-attn $\times$ 8) (self-attn $\times$ 8), and each group has one memory manager.

We can still use LRU to manage the cached blocks, by putting free blocks of all layers in the same queue and evicting the LRU block among all layers.

Moreover, we will have the freedom to customize the eviction strategy of different layer types:

mamba model only needs to cache the last tokens instead of the full kv cache.
for request [sys1 sys2 prompt1 prompt2 prompt3], we can evict sys1 prompt1 prompt2; only cache sys2 (the end of system prompt) and prompt3 (the end of request, for multi-turn conversation))
Sliding window layers only need to cache the tokens inside sliding window

The customized eviction strategies can be implemented within the memory manager of each type by assigning the LRU time tag carefully (or put into the FreeKVCacheBlockQueue in a careful order)

Milestone 2: allow allocation of pages with different size by LCM pages
In this step, we want to build a more general memory allocator to remove the same kv_hidden_size assumption.

Some use cases are listed here:

Jamba
- The mamba state per layer is 256x the kv_hidden_state/token/layer
- We can regard mamba state as a different page size with self-attention page size
Llava-style multi-modal models: the VisionEncoderCache in v1 to cache the output of vision encoder. The vision embedding size is typically different from self-attention layers.
Lower allocation overhead for models in milestone 1: even after grouping, the number of memory managers can be large. e.g., 5 managers in mllama 11B. But if we can allocate pages with different sizes, we only need 2 managers, one for self attention layers, one for cross attention layers.

The above parts will introduce different page sizes. So we need a new allocator for it.
The basic idea is to introduce a two-level page table:

Large page allocator: the page size is the lcm of page size of different types of memory
Small page allocator: one memory manager for each type of page, partition the “large” pages to allocate “small” pages, the page size is the page size of that type.

For example, mllama with 2 cross attention layers (and KV cache for image tokens) and 3 self attention layers (and KV cache for text tokens), with kv_hidden_size 128, we can have two page sizes (2*128=256, 3*128=384), and the following memory layout:

Primary result
I’ve implemented a prototype on v0 and got the following results on H100.
The speedup comes from both less fragmentation and better prefix caching.

Proposed Change.

The allocator will be implemented on v1, with the following steps

The first PR to support sliding window attention & interleaved sliding window / self attention:
1. KV cache manager
  - One manager class for each type of layer, with customized allocate / free / prefix caching logic. The global allocator calls each allocator to perform allocation in one step.
2. Runner.initialize_kv_cache
  - Instead of allocating a fixed-size tensor for each layer, we allocate one large tensor and let each layer have a different offset.
3. AttentionMetadata
  - Build for each layer
  - Change AttentionMetadata to Dict[Attention.layer_name, AttentionMetadata], can be done inside class Attention & class FlashAttentionImpl, do not need model side changes
Add new allocators and attention metadata builder to support:
1. MLA layer (after [WIP] Deepseek V2 MLA #10927)
2. mamba layer
3. KV cache sharing (needed by [New Model]: Support Tencent-Hunyuan-Large #10043 [New Model]: nvidia/Hymba-1.5B-Base #10783)
4. cross attention layer (if we decide to support encoder-decoder in v1)
5. embedding models (no LayerManager, so very little overhead)
More features (further discussion needed)
1. The two KV cache sizes in spec decode
2. EncoderCacheManager for vision embedding
3. Sparse KV cache (mentioned in [RFC]: Support KV Cache Compaction #10646)
LCM Allocation

Feedback Period.

one week

CC List.

@comaniac @WoosukKwon @zhuohan123 @simon

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

github-actions · 2025-03-21T02:04:09Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

heheda12345 · 2025-03-21T02:34:08Z

Still working on this RFC actively. Leave a comment here to keep it open.

ProExpertProg · 2025-03-24T01:03:42Z

Great RFC! Is there a timeline for this? I'm going to be looking into some KV Cache compression techniques in the next month, which would require part 2 (flexibility in page sizes).

heheda12345 · 2025-03-24T02:12:58Z

Not sure about the timeline for LCM page. For the first milestone, we plan to tune block_size to align the page size of different types, which should be finished in the next few weeks. What KV cache compression techniques are you planing to cover? Does tune block_size work?

ProExpertProg · 2025-03-24T02:24:23Z

Looking at AQUA KV, where KV caches are compressed to 2 bits. We could likely just reuse existing pages upon compression. An additional challenge will be additional workspace (for temporary decompression) which could run very large for worst-case batching of multiple long-context requests.

heheda12345 added the RFC label Dec 20, 2024

This was referenced Jan 2, 2025

[torch.compile] Hide KV cache behind torch.compile boundary #11677

Merged

[Kernel] Move attn_type to Attention.__init__() #11690

Merged

simon-mo mentioned this issue Jan 8, 2025

[Roadmap] vLLM Roadmap Q1 2025 #11862

Open

38 tasks

ShawnD200 mentioned this issue Jan 21, 2025

[RFC]: Sparse KV cache management framework #12254

Open

1 task

simon-mo mentioned this issue Jan 25, 2025

[Model] Enable Inference Support for the New Baichuan-M1 Model #12251

Open

heheda12345 mentioned this issue Feb 2, 2025

[WIP] Hybrid allocator for full attention & sliding window attention interleaved models #12655

Draft

tlrmchlsmth mentioned this issue Feb 13, 2025

[Bug]: [V1] Mamba models fail on profile run #12826

Open

1 task

heheda12345 mentioned this issue Feb 20, 2025

[V1][WIP] 2nd try of Hybrid allocator for full attention & sliding window attention interleaved models #13296

Draft

This was referenced Feb 27, 2025

[v1] Move block pool operations to a separate class #13973

Merged

[v1] Cleanup the BlockTable in InputBatch #13977

Merged

[V1] Implement sliding window attention in kv_cache_manager #14097

Merged

github-actions bot added the stale Over 90 days of inactivity label Mar 21, 2025

WoosukKwon removed the stale Over 90 days of inactivity label Mar 21, 2025

heheda12345 mentioned this issue Apr 5, 2025

[v1] Implement HybridKVCacheManager to support hybrid models with different KV cache type #16101

Draft

tlrmchlsmth mentioned this issue Apr 24, 2025

[RFC]: Native support for Mamba, SSM, and hybrid transformer models in vLLM V1 #17140

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Hybrid Memory Allocator #11382

[RFC]: Hybrid Memory Allocator #11382

heheda12345 commented Dec 20, 2024 •

edited

Loading

github-actions bot commented Mar 21, 2025

Uh oh!

heheda12345 commented Mar 21, 2025

Uh oh!

ProExpertProg commented Mar 24, 2025

Uh oh!

heheda12345 commented Mar 24, 2025

Uh oh!

ProExpertProg commented Mar 24, 2025

Uh oh!

Uh oh!

[RFC]: Hybrid Memory Allocator #11382

[RFC]: Hybrid Memory Allocator #11382

Comments

heheda12345 commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

github-actions bot commented Mar 21, 2025

Uh oh!

heheda12345 commented Mar 21, 2025

Uh oh!

ProExpertProg commented Mar 24, 2025

Uh oh!

heheda12345 commented Mar 24, 2025

Uh oh!

ProExpertProg commented Mar 24, 2025

Uh oh!

heheda12345 commented Dec 20, 2024 •

edited

Loading