-
Notifications
You must be signed in to change notification settings - Fork 11.4k
Broken generation with specific ngl values #3820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This sounds like the same problem I encountered in #2422 Edit: I wanted to add that this only happens when offloading all layers to the GPU with CUBLAS, because CLBLAST does not offload the kv cache as far as I know. |
@WeirdConstructor Some kind of explicit back-sync function would pretty much solve it, and as it only needs to happen on state save/load. They k/v split would probably have to be solved by rounding up n_gpu_layers so k/v cache it's always either offloaded or not offloaded. OpenCL didn't have this problem, but suspiciously it said there are exactly 2 less layers "offloadable", which would now make sense. |
@staviq My primary use case is: swapping prompts back and forth for evaluation in my prompt_runner branch - which I use for various benchmark purposes ( https://github.com/WeirdConstructor/llama.cpp/tree/prompt_runner/examples/prompt_runner ). It runs multiple prompts from a JSON file through inference and stores the results afterwards. As the prompts are often very similar it would help a lot if I could reuse the kv cache from previous decoding steps. I was confused by the llama_set_state_data() code in the Also neither |
cc @slaren edit: This (the -ngl N-1 issue with all examples) is a regression somewhere between b1198 (ebc9608) and b1398 (4e82b2e).
|
@cebtenzzre #2422 is older (July) than those commits (September/October). |
Since the -ngl N-1 issue was fixed, do we still need both this issue and #2422? |
Something is still broken, N-1
N
|
Yes, it's still a problem, like I posted over in the PR: #3982 (comment) save-load-state in combination with GPU offloading has been broken for a long time. |
While playing with implementing compression for copy/save state, I found a bug, which turned out to be reproducible in current
main
(41aee4d)It seems to be model independent, and no parameters other than
-ngl
seem to make a difference either.The first symptom happens for
save-load-state
,main
andserver
, when-ngl
equal to exactly N-1 is specified, basically this happens (generated output):Second symptom was found by accident, when fiddling with
save-load-state
for the purpose of implementing compression. Basically, if-ngl
is N or bigger (all layers loaded),The problem above, seems to disappear, however:
Not only
save-load-state
fails because generated text is different for both runs,but also, after some tokens were sampled
llama_copy_state_data
outputs mostly empty array, which I only noticed because I tried to dump the state post generation, and suddenly started to get 99% compression ratio on that array. Because it turned out to be mostly zeroes.All
-ngl
values between 0 - (N-2) work properly.I have no way of testing on AMD so I do not know if it's Nvidia specific.
main.output.txt
main.log
As a sanity check, here are results for
-ngl
from 0 to N with the same model and parameters (except-ngl
):out.txt
Edit: Interestingly enough, perplexity looks fine ?
The text was updated successfully, but these errors were encountered: