-
Notifications
You must be signed in to change notification settings - Fork 11.5k
Bug: K cache without FA goes Nan on Llama 3.1. #10011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since you say "anymore", can you do a git bisect and identify the commit that introduced the issue? |
@JohannesGaessler See my comment here |
@JohannesGaessler : I'm spared a headache here. @ikawrakow : Thanks! |
If so, does compiling with |
Regarding numerical precision in the KQ matrix multiplication: I didn't actually check this but I very much suspect the problem to be the reduced numerical range of FP16 rather than the reduced precision vs. FP32. The MMQ kernels do the dot product using integers, upcast the FP16 scales of the blocks to FP32, then do the scaling and accumulation at FP32 precision. That should not result in any precision issues vs. regular matrix multiplications so I expect any problems to just come from regular bugs. |
I can reproduce the issue. It is caused by the use of MMQ but since MMVQ works correctly (and essentially does the same operations) it should not be an issue with numerics. |
Speaking of which, looking at the
But then it gets called from the
i.e., with a reference to |
When you supply a |
That would be true if it was passed by value. But this functions takes a const reference. In normal (non-CUDA) C++ this would be compile time error. What happens here? The compiler silently replaces pass by reference with pass by value and implicitly converts the argument? This does not sound right. |
Temporaries can be bound to const-references, this is standard C++ behavior. |
You can check the resulting PTX code in NSight Compute. This is a snippet from the relevant section:
As you can see it has a |
I think the problem has to do with the KV cache dimensions. Because of how slow conditional statements are on GPUs out-of-bounds reads in the |
The KV cache is initialized to zero in |
I misinterpreted the results from Gemma 1. The reason why the results are correct is not the head size but the fact that that particular model only has a single head. As a result the K cache data layout does not change when dimensions 1 and 2 are permuted. I extended |
Should be fixed by #10021 . This issue seems to have been on master for a long time and apparently no one noticed; I myself thought I had tested K cache quantization but I obviously must have done something wrong. |
No, that was before I ever touched the MMQ code and I vaguely remember that there was another issue with K cache quantization beforehand. |
There still seems to be an issue with the combination of a q8_0 model with |
Should be fixed by #10032 (for real this time). |
@JohannesGaessler, congrats and thank you! Using #10015 , #10021 , #10032 all-together, Quantum cache K without FA now works again for every mode, aka. K F16, q8_0, q5_1, q5_0, q4_1, q4_0, and even iq4_nl (which also works for PPL test in -FA KV mode iq4_nl/iq4_nl). As a side note, it might be pertinent to enable K iq4_nl in generation also, and why not in FA mode. @ikawrakow made the work on IK_Llama, and all users would benefit from having this option over the q4_0 K non-FA / KV FA cache, which perplexity is a whole 1% higher for the same bitrate. And I merged it on my KoboldCPP fork with sheer enthusiasm. |
If I remember correctly at the time that I added CUDA/FA support for quantized KV caches iq4_nl was not an option. I will revisit FA in the next few months and add support for iq4_nl while I'm at it (unless I forget). |
Okay. Thanks again, Johannes, for the fast fixes. |
What happened?
With the non-FA Quantum K cache, q4_0, q4_1, q5_0, q5_1, q8_0 do not work anymore, and go NaN instead..
Tested on Llama 3.1 8b Q5_K.
Name and Version
llama-cli, b3962, (written 4104 due to my edits on quant strategies)
What operating system are you seeing the problem on?
Windows
Relevant log output
The text was updated successfully, but these errors were encountered: