Skip to content

Broken generation with specific ngl values #3820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
staviq opened this issue Oct 27, 2023 · 9 comments
Closed

Broken generation with specific ngl values #3820

staviq opened this issue Oct 27, 2023 · 9 comments
Labels
bug Something isn't working generation quality Quality of model output Nvidia GPU Issues specific to Nvidia GPUs

Comments

@staviq
Copy link
Contributor

staviq commented Oct 27, 2023

While playing with implementing compression for copy/save state, I found a bug, which turned out to be reproducible in current main (41aee4d)

It seems to be model independent, and no parameters other than -ngl seem to make a difference either.

The first symptom happens for save-load-state, main and server, when -ngl equal to exactly N-1 is specified, basically this happens (generated output):

 Hello there!###############################

Second symptom was found by accident, when fiddling with save-load-state for the purpose of implementing compression. Basically, if -ngl is N or bigger (all layers loaded),
The problem above, seems to disappear, however:
Not only save-load-state fails because generated text is different for both runs,
but also, after some tokens were sampled llama_copy_state_data outputs mostly empty array, which I only noticed because I tried to dump the state post generation, and suddenly started to get 99% compression ratio on that array. Because it turned out to be mostly zeroes.

All -ngl values between 0 - (N-2) work properly.

I have no way of testing on AMD so I do not know if it's Nvidia specific.

main.output.txt
main.log

As a sanity check, here are results for -ngl from 0 to N with the same model and parameters (except -ngl):

out.txt

Edit: Interestingly enough, perplexity looks fine ?

-ngl N-2 (27/29)
[1]5.2069,[2]5.1932,[3]5.1802,[4]5.2837,[5]5.2742,[6]5.0776,
Final estimate: PPL = 5.0776 +/- 0.25768
-ngl N-1 (28/29)
[1]5.2069,[2]5.1932,[3]5.1802,[4]5.2837,[5]5.2742,[6]5.0776,
Final estimate: PPL = 5.0776 +/- 0.25768
-ngl N (29/29)
[1]5.2077,[2]5.1813,[3]5.1687,[4]5.2820,[5]5.2682,[6]5.0756,
Final estimate: PPL = 5.0756 +/- 0.25766
@staviq staviq added generation quality Quality of model output Nvidia GPU Issues specific to Nvidia GPUs labels Oct 27, 2023
@WeirdConstructor
Copy link
Contributor

WeirdConstructor commented Nov 7, 2023

This sounds like the same problem I encountered in #2422
The problem comes from improperly stored/loaded kv cache. The last 2 layers offloaded to GPU are the V and K cache. If you only offload the V cache, you get garbage output. And if you offload both you get a problem with saving and loading state.

Edit: I wanted to add that this only happens when offloading all layers to the GPU with CUBLAS, because CLBLAST does not offload the kv cache as far as I know.

@staviq staviq added the bug Something isn't working label Nov 7, 2023
@Galunid Galunid mentioned this issue Nov 7, 2023
12 tasks
@staviq
Copy link
Contributor Author

staviq commented Nov 7, 2023

@WeirdConstructor
That pretty much coincides with my intuition, I m just not familiar enough with that part of the code.

Some kind of explicit back-sync function would pretty much solve it, and as it only needs to happen on state save/load.

They k/v split would probably have to be solved by rounding up n_gpu_layers so k/v cache it's always either offloaded or not offloaded.

OpenCL didn't have this problem, but suspiciously it said there are exactly 2 less layers "offloadable", which would now make sense.

@WeirdConstructor
Copy link
Contributor

WeirdConstructor commented Nov 7, 2023

@staviq
I dived into the code a bit to see if I could somehow hack it into my utility. In ggml-cuda there are primitives to read back the tensor data, but it doesn't seem to be used anywhere (yet).

My primary use case is: swapping prompts back and forth for evaluation in my prompt_runner branch - which I use for various benchmark purposes ( https://github.com/WeirdConstructor/llama.cpp/tree/prompt_runner/examples/prompt_runner ). It runs multiple prompts from a JSON file through inference and stores the results afterwards.

As the prompts are often very similar it would help a lot if I could reuse the kv cache from previous decoding steps.

I was confused by the llama_set_state_data() code in the llama.cpp file that temporarily allocates a temporary graph with "tensor_3d" and does some update stuff - instead of just loading the kv cache data directly into the right kv cache tensors. - So I think I am missing too much of the internals to mess around with them.

Also neither ggml_backend_tensor_copy() nor ggml_backend_tensor_set() and ggml_backend_tensor_get() seem to be used somewhere in the llama.cpp project, which made me suspicious and I did not dare to dabble with an unused API (without example usage).

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Nov 7, 2023

cc @slaren

edit: This (the -ngl N-1 issue with all examples) is a regression somewhere between b1198 (ebc9608) and b1398 (4e82b2e).
edit 2: It's db3abcc

db3abcc114d5d1790ba814aa1a80ac673d4ccc3e is the first bad commit
commit db3abcc114d5d1790ba814aa1a80ac673d4ccc3e
Author: Georgi Gerganov <[email protected]>
Date:   Sun Oct 8 20:19:14 2023 +0300

    sync : ggml (ggml-backend) (#3548)
    
    * sync : ggml (ggml-backend)
    
    ggml-ci
    
    * zig : add ggml-backend to the build

 CMakeLists.txt       |   2 +
 Makefile             |   7 +-
 build.zig            |  15 +-
 ggml-alloc.c         | 169 ++++++----------
 ggml-alloc.h         |  16 +-
 ggml-backend.c       | 385 ++++++++++++++++++++++++++++++++++++
 ggml-backend.h       | 143 ++++++++++++++
 ggml-cuda.cu         | 535 +++++++++++++++++++++++++++++++++++++++++++--------
 ggml-cuda.h          |   4 +
 ggml-metal.h         |  19 +-
 ggml-metal.m         | 137 +++++++++++++
 ggml.c               |  37 +---
 ggml.h               |  16 +-
 llama.cpp            |  44 ++---
 scripts/sync-ggml.sh |  24 +--
 15 files changed, 1285 insertions(+), 268 deletions(-)
 create mode 100644 ggml-backend.c
 create mode 100644 ggml-backend.h

@WeirdConstructor
Copy link
Contributor

edit: This is a regression somewhere between b1198 (ebc9608) and b1398 (4e82b2e).
edit 2: It's db3abcc

@cebtenzzre #2422 is older (July) than those commits (September/October).

@cebtenzzre
Copy link
Collaborator

Since the -ngl N-1 issue was fixed, do we still need both this issue and #2422?

@staviq
Copy link
Contributor Author

staviq commented Nov 9, 2023

@cebtenzzre

Something is still broken, save-load-state produces garbage in the second run for both N and N-1 now (875fb42)

N-1

llm_load_tensors: offloaded 34/35 layers to GPU
llm_load_tensors: VRAM used: 7205.84 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: VRAM kv self = 32.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7310.84 MB (model: 7205.84 MB, context: 105.00 MB)

first run: The quick brown fox jumps over the lazy dog. This sentence contains all 26 letters of

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: VRAM kv self = 32.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7310.84 MB (model: 7205.84 MB, context: 105.00 MB)

second run: The quick brown fox jihads defined as non- from absolute to the right around.


main : error : the 2 generations are different

N

llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 7205.84 MB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7342.84 MB (model: 7205.84 MB, context: 137.00 MB)

first run: The quick brown fox jumps over the lazy dog.

The first Friday of every month is

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 64.00 MB
llama_new_context_with_model: kv self size  =   64.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 79.63 MB
llama_new_context_with_model: VRAM scratch buffer: 73.00 MB
llama_new_context_with_model: total VRAM used: 7342.84 MB (model: 7205.84 MB, context: 137.00 MB)

second run: The quick brown fox j barriers ep joe/ person_maiden/ canonical/ github.

main : error : the 2 generations are different

@WeirdConstructor
Copy link
Contributor

Yes, it's still a problem, like I posted over in the PR: #3982 (comment)

save-load-state in combination with GPU offloading has been broken for a long time.

@staviq
Copy link
Contributor Author

staviq commented Nov 9, 2023

Ok, I went through #2422 and it seems to describe the second part of this issue pretty well

So I think this can be marked as duplicate of #2422 and closed, as they are at this point redundant.

Duplicate #2422

@staviq staviq closed this as completed Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working generation quality Quality of model output Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

No branches or pull requests

3 participants