Skip to content

Sharing common KV/prompt for parallel batches but without vram duplication - possible ? #12165

Answered by ggerganov
cmp-nct asked this question in Q&A
Discussion options

You must be logged in to vote

This functionality has been available since the very beginning of the batching support - see the llama-batched, llama-batched-bench and llama-parallel examples. It works by simply assigning the tokens in the batch to multiple sequences.

The llama-server does not use it, because it's very unlikely that 2 clients would have a common prompt at the same time. But to compensate, it has a prompt cache and cache reuse (--cache-reuse) functionality.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@cmp-nct
Comment options

Answer selected by cmp-nct
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants