Sharing common KV/prompt for parallel batches but without vram duplication - possible ? #12165

cmp-nct · 2025-03-03T15:47:32Z

cmp-nct
Mar 3, 2025

About a year ago llama.cpp was upgraded with the better batching API, that included the functions to copy kv cache between sequences.
That avoids having to process the same prompt multiple times if it's shared among parallel clients but it does not avoid the memory duplication issue.
That's a minor issue with a small system prompt but it's a showstopper for large shared context.

Wouldn't it be possible to process the shared context from one shared k/v sequence and "private" k/v tokens in their separate memory slots ?
Is it computationally impossible to mask the k/v that way?

Answered by ggerganov

Mar 3, 2025

This functionality has been available since the very beginning of the batching support - see the llama-batched, llama-batched-bench and llama-parallel examples. It works by simply assigning the tokens in the batch to multiple sequences.

The llama-server does not use it, because it's very unlikely that 2 clients would have a common prompt at the same time. But to compensate, it has a prompt cache and cache reuse (--cache-reuse) functionality.

View full answer

ggerganov · 2025-03-03T16:28:12Z

ggerganov
Mar 3, 2025
Maintainer

This functionality has been available since the very beginning of the batching support - see the llama-batched, llama-batched-bench and llama-parallel examples. It works by simply assigning the tokens in the batch to multiple sequences.

The llama-server does not use it, because it's very unlikely that 2 clients would have a common prompt at the same time. But to compensate, it has a prompt cache and cache reuse (--cache-reuse) functionality.

1 reply

cmp-nct Mar 3, 2025
Author

I thought it literally copies them., I've had no idea that does just an assignment, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharing common KV/prompt for parallel batches but without vram duplication - possible ? #12165

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Sharing common KV/prompt for parallel batches but without vram duplication - possible ? #12165

cmp-nct Mar 3, 2025

Replies: 1 comment · 1 reply

ggerganov Mar 3, 2025 Maintainer

cmp-nct Mar 3, 2025 Author

cmp-nct
Mar 3, 2025

Replies: 1 comment 1 reply

ggerganov
Mar 3, 2025
Maintainer

cmp-nct Mar 3, 2025
Author