-
About a year ago llama.cpp was upgraded with the better batching API, that included the functions to copy kv cache between sequences. Wouldn't it be possible to process the shared context from one shared k/v sequence and "private" k/v tokens in their separate memory slots ? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
This functionality has been available since the very beginning of the batching support - see the The |
Beta Was this translation helpful? Give feedback.
This functionality has been available since the very beginning of the batching support - see the
llama-batched
,llama-batched-bench
andllama-parallel
examples. It works by simply assigning the tokens in the batch to multiple sequences.The
llama-server
does not use it, because it's very unlikely that 2 clients would have a common prompt at the same time. But to compensate, it has a prompt cache and cache reuse (--cache-reuse
) functionality.