Broken generation with specific ngl values #3820

staviq · 2023-10-27T22:49:53Z

While playing with implementing compression for copy/save state, I found a bug, which turned out to be reproducible in current main (41aee4d)

It seems to be model independent, and no parameters other than -ngl seem to make a difference either.

The first symptom happens for save-load-state, main and server, when -ngl equal to exactly N-1 is specified, basically this happens (generated output):

 Hello there!###############################

Second symptom was found by accident, when fiddling with save-load-state for the purpose of implementing compression. Basically, if -ngl is N or bigger (all layers loaded),
The problem above, seems to disappear, however:
Not only save-load-state fails because generated text is different for both runs,
but also, after some tokens were sampled llama_copy_state_data outputs mostly empty array, which I only noticed because I tried to dump the state post generation, and suddenly started to get 99% compression ratio on that array. Because it turned out to be mostly zeroes.

All -ngl values between 0 - (N-2) work properly.

I have no way of testing on AMD so I do not know if it's Nvidia specific.

main.output.txt
main.log

As a sanity check, here are results for -ngl from 0 to N with the same model and parameters (except -ngl):

out.txt

Edit: Interestingly enough, perplexity looks fine ?

-ngl N-2 (27/29)
[1]5.2069,[2]5.1932,[3]5.1802,[4]5.2837,[5]5.2742,[6]5.0776,
Final estimate: PPL = 5.0776 +/- 0.25768
-ngl N-1 (28/29)
[1]5.2069,[2]5.1932,[3]5.1802,[4]5.2837,[5]5.2742,[6]5.0776,
Final estimate: PPL = 5.0776 +/- 0.25768
-ngl N (29/29)
[1]5.2077,[2]5.1813,[3]5.1687,[4]5.2820,[5]5.2682,[6]5.0756,
Final estimate: PPL = 5.0756 +/- 0.25766

The text was updated successfully, but these errors were encountered:

WeirdConstructor · 2023-11-07T09:55:45Z

This sounds like the same problem I encountered in #2422
The problem comes from improperly stored/loaded kv cache. The last 2 layers offloaded to GPU are the V and K cache. If you only offload the V cache, you get garbage output. And if you offload both you get a problem with saving and loading state.

Edit: I wanted to add that this only happens when offloading all layers to the GPU with CUBLAS, because CLBLAST does not offload the kv cache as far as I know.

staviq · 2023-11-07T10:32:25Z

@WeirdConstructor
That pretty much coincides with my intuition, I m just not familiar enough with that part of the code.

Some kind of explicit back-sync function would pretty much solve it, and as it only needs to happen on state save/load.

They k/v split would probably have to be solved by rounding up n_gpu_layers so k/v cache it's always either offloaded or not offloaded.

OpenCL didn't have this problem, but suspiciously it said there are exactly 2 less layers "offloadable", which would now make sense.

WeirdConstructor · 2023-11-07T13:35:01Z

@staviq
I dived into the code a bit to see if I could somehow hack it into my utility. In ggml-cuda there are primitives to read back the tensor data, but it doesn't seem to be used anywhere (yet).

My primary use case is: swapping prompts back and forth for evaluation in my prompt_runner branch - which I use for various benchmark purposes ( https://github.com/WeirdConstructor/llama.cpp/tree/prompt_runner/examples/prompt_runner ). It runs multiple prompts from a JSON file through inference and stores the results afterwards.

As the prompts are often very similar it would help a lot if I could reuse the kv cache from previous decoding steps.

I was confused by the llama_set_state_data() code in the llama.cpp file that temporarily allocates a temporary graph with "tensor_3d" and does some update stuff - instead of just loading the kv cache data directly into the right kv cache tensors. - So I think I am missing too much of the internals to mess around with them.

Also neither ggml_backend_tensor_copy() nor ggml_backend_tensor_set() and ggml_backend_tensor_get() seem to be used somewhere in the llama.cpp project, which made me suspicious and I did not dare to dabble with an unused API (without example usage).

cebtenzzre · 2023-11-07T18:08:31Z

cc @slaren

edit: This (the -ngl N-1 issue with all examples) is a regression somewhere between b1198 (ebc9608) and b1398 (4e82b2e).
edit 2: It's db3abcc

db3abcc114d5d1790ba814aa1a80ac673d4ccc3e is the first bad commit
commit db3abcc114d5d1790ba814aa1a80ac673d4ccc3e
Author: Georgi Gerganov <[email protected]>
Date:   Sun Oct 8 20:19:14 2023 +0300

    sync : ggml (ggml-backend) (#3548)
    
    * sync : ggml (ggml-backend)
    
    ggml-ci
    
    * zig : add ggml-backend to the build

 CMakeLists.txt       |   2 +
 Makefile             |   7 +-
 build.zig            |  15 +-
 ggml-alloc.c         | 169 ++++++----------
 ggml-alloc.h         |  16 +-
 ggml-backend.c       | 385 ++++++++++++++++++++++++++++++++++++
 ggml-backend.h       | 143 ++++++++++++++
 ggml-cuda.cu         | 535 +++++++++++++++++++++++++++++++++++++++++++--------
 ggml-cuda.h          |   4 +
 ggml-metal.h         |  19 +-
 ggml-metal.m         | 137 +++++++++++++
 ggml.c               |  37 +---
 ggml.h               |  16 +-
 llama.cpp            |  44 ++---
 scripts/sync-ggml.sh |  24 +--
 15 files changed, 1285 insertions(+), 268 deletions(-)
 create mode 100644 ggml-backend.c
 create mode 100644 ggml-backend.h

WeirdConstructor · 2023-11-08T08:48:07Z

edit: This is a regression somewhere between b1198 (ebc9608) and b1398 (4e82b2e).
edit 2: It's db3abcc

@cebtenzzre #2422 is older (July) than those commits (September/October).

cebtenzzre · 2023-11-09T00:28:26Z

Since the -ngl N-1 issue was fixed, do we still need both this issue and #2422?

staviq · 2023-11-09T01:50:47Z

@cebtenzzre

Something is still broken, save-load-state produces garbage in the second run for both N and N-1 now (875fb42)

N-1

llm_load_tensors: offloaded 34/35 layers to GPU
llm_load_tensors: VRAM used: 7205.84 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: VRAM kv self = 32.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7310.84 MB (model: 7205.84 MB, context: 105.00 MB)

first run: The quick brown fox jumps over the lazy dog. This sentence contains all 26 letters of

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: VRAM kv self = 32.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7310.84 MB (model: 7205.84 MB, context: 105.00 MB)

second run: The quick brown fox jihads defined as non- from absolute to the right around.


main : error : the 2 generations are different

N

llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7205.84 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7342.84 MB (model: 7205.84 MB, context: 137.00 MB)

first run: The quick brown fox jumps over the lazy dog.

The first Friday of every month is

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7342.84 MB (model: 7205.84 MB, context: 137.00 MB)

second run: The quick brown fox j barriers ep joe/ person_maiden/ canonical/ github.

main : error : the 2 generations are different

WeirdConstructor · 2023-11-09T05:05:04Z

Yes, it's still a problem, like I posted over in the PR: #3982 (comment)

save-load-state in combination with GPU offloading has been broken for a long time.

staviq · 2023-11-09T14:08:31Z

Ok, I went through #2422 and it seems to describe the second part of this issue pretty well

So I think this can be marked as duplicate of #2422 and closed, as they are at this point redundant.

Duplicate #2422

staviq added generation quality Quality of model output Nvidia GPU Issues specific to Nvidia GPUs labels Oct 27, 2023

staviq added the bug Something isn't working label Nov 7, 2023

Galunid mentioned this issue Nov 7, 2023

StableLM support #3586

Merged

12 tasks

cebtenzzre mentioned this issue Nov 7, 2023

sync : ggml (ggml-backend) #3548

Merged

slaren mentioned this issue Nov 7, 2023

ggml-alloc : fix backend assignments of views #3982

Merged

staviq closed this as completed Nov 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken generation with specific ngl values #3820

Broken generation with specific ngl values #3820

staviq commented Oct 27, 2023 •

edited

Loading

WeirdConstructor commented Nov 7, 2023 •

edited

Loading

staviq commented Nov 7, 2023

WeirdConstructor commented Nov 7, 2023 •

edited

Loading

cebtenzzre commented Nov 7, 2023 •

edited

Loading

WeirdConstructor commented Nov 8, 2023

cebtenzzre commented Nov 9, 2023

staviq commented Nov 9, 2023

WeirdConstructor commented Nov 9, 2023

staviq commented Nov 9, 2023 •

edited

Loading

Broken generation with specific ngl values #3820

Broken generation with specific ngl values #3820

Comments

staviq commented Oct 27, 2023 • edited Loading

WeirdConstructor commented Nov 7, 2023 • edited Loading

staviq commented Nov 7, 2023

WeirdConstructor commented Nov 7, 2023 • edited Loading

cebtenzzre commented Nov 7, 2023 • edited Loading

WeirdConstructor commented Nov 8, 2023

cebtenzzre commented Nov 9, 2023

staviq commented Nov 9, 2023

WeirdConstructor commented Nov 9, 2023

staviq commented Nov 9, 2023 • edited Loading

staviq commented Oct 27, 2023 •

edited

Loading

WeirdConstructor commented Nov 7, 2023 •

edited

Loading

WeirdConstructor commented Nov 7, 2023 •

edited

Loading

cebtenzzre commented Nov 7, 2023 •

edited

Loading

staviq commented Nov 9, 2023 •

edited

Loading