server : reuse context chunks #9866

ggerganov · 2024-10-12T13:18:09Z

Overview

Using a positive --cache-reuse argument with llama-server will attempt to reuse KV chunks with size equal or larger than the specified value. The KV cache of reused chunks will be shifted (see llama_kv_cache_seq_add()) in the respective position and processing for these tokens will be skipped. ~~Only chunks without control/special tokens will be reused.~~ Here is an illustration:

# here each letter generally corresponds to a different token
# same letters represent groups of tokens that are the same in both requests, but are located in different positions

# prompt 0 (cached)
aaaaabbbbbcccccccdddddeeeeeexffggggghhhhhhhxxxxxxxxx

# prompt 1
aaaaaccccccceeeeeeffhhhhhhhyyyyyyyy

Upon submitting prompt 1 for processing, after prompt 0 has been processed and cached:

--cache-reuse 0: only the aaaaa prefix will be reused
--cache-reuse 1: the entire aaaaaccccccceeeeeeffhhhhhhh will be reused
--cache-reuse 3: only the aaaaaccccccceeeeee part will be reused

The cache reuse will be done only for requests with "cache_prompt": true.

Example

# start a server with cache reusing enabled
./llama-server -m ${model.gguf} --port 8012 --cache-reuse 512

# long request with the word "hello" repeated 512 times
chunk=$(printf 'hello %.0s' {1..512})
curl \
    --request POST --url http://127.0.0.1:8012/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Some prefix. Reuse: '"${chunk}"'", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# ... computes 519 tokens ...

# submit new request with the prefix removed. note the leading space before "Reuse"
curl \
    --request POST --url http://127.0.0.1:8012/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": " Reuse: '"${chunk}"'", "n_predict": 1, "cache_prompt": true, "temperature": 0.0}' | jq

# ... reuses 516 tokens and computes just 1 token ...

wooooyeahhhh · 2024-10-12T14:00:22Z

Does this work similar to Koboldcpp's context shift?

ngxson · 2024-10-12T18:31:18Z

Does this work similar to Koboldcpp's context shift?

If I understand correctly from this post then yes, it is.

Before, I had a similar feature request here: #5793 , which will be possible thanks to the current PR

ggerganov · 2024-10-12T19:33:10Z

Yes, it's the same idea as proposed in #5793. I've been experimenting today with context reuse for code completion and results seem promising.

ngxson · 2024-10-12T20:30:06Z

Btw @ggerganov a while ago I remembered there was a discussion on storing token ID on KV cache. I'm wondering if it's complicated to add API like llama_kv_get_tokens(int seq_id) and use it instead of having to synchronize between actual KV and slot.cache_tokens. What do you think?

ggml-ci

ggerganov · 2024-10-13T10:29:26Z

We should extend the API to support that. Maybe llama_token id = llama_kv_cache_seq_get_token(ctx, seq_id, pos);

ggml-ci

ngxson · 2024-11-01T10:01:25Z

I have a small question regarding the illustration on the description:

--cache-reuse 3: only the aaaaaccccccceeeeee part will be reused

AFAIU we only skip the ff part because its length is less than 3. But in this case, why the next part hhhhhhh is also skipped?

ggerganov · 2024-11-01T10:12:15Z

It's skipped mainly to simplify the batch construction:

With the current implementation, we stop reusing chunks at the first token that cannot be reused. This way, when we create the llama_batch for the new prompt, we start from n_past and add all remaining tokens with increasing positions:

n_past:   f
n_past+1: f
n_past+2: h
n_past+3: h
...
n_past+2+H+Y: y

The alternative that you suggest is if we reused the h chunk. In that case the new batch would have to look like this:

pos_f:    f
pos_f+1:  f
pos_y:    y
pos_y+1:  y
...
pos_y+Y:  y

There is no longer the concept of n_past. Instead, we would have to maintain more complicated information about the token positions.

I'm very interested in trying this approach and see if it is viable, but the extra complexity at this point would be took much. Maybe in the future.

ggml-ci

ExtReMLapin · 2024-12-10T15:32:50Z

What are the downside of sticking to --cache-reuse 1 if it's the solution that (from my understanding) each token individually to see if it's cached ?

Thanks.

ggerganov · 2024-12-10T15:37:35Z

Not sure if there are downsides yet - needs testing. Intuitively, reusing very small chunks might not be a good idea since they can have different meanings.

ExtReMLapin · 2025-02-12T13:44:16Z

Call me an ignorant but from my understanding of this feature we can cache parts of the prompts.

Meaning that in a prompt of 20k tokens, we can take part of this processed prompt and reuse it later outside of it's original context.

Which means that this piece of data can be individually used outside of it's neighbors/context.

Then why the farthest we are in prompt processing, the slower it goes if prompts parts can be processed individually ?

Right now i'm toying with Qwen 7B 1M and a context size of TWO MILLIONS tokens.

Obviously, around 100k tokens it starts getting slow as hell, but if I individually processed each document alone (not slow) and later shoved the whole 2M chunk that would be cached, it would be faster than just processing it as it is right now ?

github-actions bot added examples server labels Oct 12, 2024

server : reuse context chunks

27addf5

ggml-ci

ggerganov force-pushed the gg/server-reuse-context branch from a6b048e to 27addf5 Compare October 13, 2024 10:08

ggerganov marked this pull request as ready for review October 13, 2024 10:24

ggerganov mentioned this pull request Oct 13, 2024

llama.vim : plugin for Neovim #9787

Merged

7 tasks

ggerganov merged commit c7181bd into master Oct 13, 2024
58 checks passed

ggerganov deleted the gg/server-reuse-context branch October 13, 2024 15:52

This was referenced Oct 13, 2024

server : accept extra_context for the infill endpoint #9874

Merged

server : remove system prompt support #9811

Closed

drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024

server : reuse cached context chunks (ggml-org#9866)

4682f73

ggml-ci

dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024

server : reuse cached context chunks (ggml-org#9866)

3d63e80

ggml-ci

ngxson mentioned this pull request Oct 31, 2024

server : refactor slot input data, move tokenizer to HTTP thread #10023

Merged

7 tasks

sasha0552 mentioned this pull request Nov 1, 2024

server : fix smart selection of available slot #10120

Merged

4 tasks

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024

server : reuse cached context chunks (ggml-org#9866)

fefbeb8

ggml-ci

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024

server : reuse cached context chunks (ggml-org#9866)

494e83c

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : reuse context chunks #9866

server : reuse context chunks #9866

ggerganov commented Oct 12, 2024 •

edited

Loading

wooooyeahhhh commented Oct 12, 2024

ngxson commented Oct 12, 2024

ggerganov commented Oct 12, 2024

ngxson commented Oct 12, 2024

ggerganov commented Oct 13, 2024

ngxson commented Nov 1, 2024

ggerganov commented Nov 1, 2024

ExtReMLapin commented Dec 10, 2024

ggerganov commented Dec 10, 2024

ExtReMLapin commented Feb 12, 2025

server : reuse context chunks #9866

server : reuse context chunks #9866

Conversation

ggerganov commented Oct 12, 2024 • edited Loading

Overview

Example

wooooyeahhhh commented Oct 12, 2024

ngxson commented Oct 12, 2024

ggerganov commented Oct 12, 2024

ngxson commented Oct 12, 2024

ggerganov commented Oct 13, 2024

ngxson commented Nov 1, 2024

ggerganov commented Nov 1, 2024

ExtReMLapin commented Dec 10, 2024

ggerganov commented Dec 10, 2024

ExtReMLapin commented Feb 12, 2025

ggerganov commented Oct 12, 2024 •

edited

Loading