-
Notifications
You must be signed in to change notification settings - Fork 11.6k
server : reuse context chunks #9866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Does this work similar to Koboldcpp's context shift? |
Yes, it's the same idea as proposed in #5793. I've been experimenting today with context reuse for code completion and results seem promising. |
Btw @ggerganov a while ago I remembered there was a discussion on storing token ID on KV cache. I'm wondering if it's complicated to add API like |
ggml-ci
a6b048e
to
27addf5
Compare
We should extend the API to support that. Maybe |
I have a small question regarding the illustration on the description:
AFAIU we only skip the |
It's skipped mainly to simplify the batch construction: With the current implementation, we stop reusing chunks at the first token that cannot be reused. This way, when we create the
The alternative that you suggest is if we reused the
There is no longer the concept of I'm very interested in trying this approach and see if it is viable, but the extra complexity at this point would be took much. Maybe in the future. |
What are the downside of sticking to Thanks. |
Not sure if there are downsides yet - needs testing. Intuitively, reusing very small chunks might not be a good idea since they can have different meanings. |
Call me an ignorant but from my understanding of this feature we can cache parts of the prompts. Meaning that in a prompt of 20k tokens, we can take part of this processed prompt and reuse it later outside of it's original context. Which means that this piece of data can be individually used outside of it's neighbors/context. Then why the farthest we are in prompt processing, the slower it goes if prompts parts can be processed individually ? Right now i'm toying with Qwen 7B 1M and a context size of TWO MILLIONS tokens. Obviously, around 100k tokens it starts getting slow as hell, but if I individually processed each document alone (not slow) and later shoved the whole 2M chunk that would be cached, it would be faster than just processing it as it is right now ? |
ref #5793
Overview
Using a positive
--cache-reuse
argument withllama-server
will attempt to reuse KV chunks with size equal or larger than the specified value. The KV cache of reused chunks will be shifted (seellama_kv_cache_seq_add()
) in the respective position and processing for these tokens will be skipped.Only chunks without control/special tokens will be reused.Here is an illustration:Upon submitting
prompt 1
for processing, afterprompt 0
has been processed and cached:--cache-reuse 0
: only theaaaaa
prefix will be reused--cache-reuse 1
: the entireaaaaaccccccceeeeeeffhhhhhhh
will be reused--cache-reuse 3
: only theaaaaaccccccceeeeee
part will be reusedThe cache reuse will be done only for requests with
"cache_prompt": true
.Example