Skip to content

Bug: K cache without FA goes Nan on Llama 3.1. #10011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Nexesenex opened this issue Oct 23, 2024 · 22 comments · Fixed by #10021 or #10032
Closed

Bug: K cache without FA goes Nan on Llama 3.1. #10011

Nexesenex opened this issue Oct 23, 2024 · 22 comments · Fixed by #10021 or #10032
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

Comments

@Nexesenex
Copy link
Contributor

Nexesenex commented Oct 23, 2024

What happened?

With the non-FA Quantum K cache, q4_0, q4_1, q5_0, q5_1, q8_0 do not work anymore, and go NaN instead..

Tested on Llama 3.1 8b Q5_K.

Name and Version

llama-cli, b3962, (written 4104 due to my edits on quant strategies)

What operating system are you seeing the problem on?

Windows

Relevant log output

Q:\LLAMA_CUDA_123>llama-perplexity -m D:\text-generation-webui\models\Meta_Llama_3.1_8b_it-f16-iMat-Q5_K_S_q4_v6.gguf -f wiki.test.raw --parallel 1 -ngl 150 -b 1024 -ts 40,0 --no-mmap -ctk q8_0 -c 512 --chunks 211
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 4104 (aa180f4cb) with MSVC 19.38.33141.0 for
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 292 tensors from D:\text-generation-webui\models\Meta_Llama_3.1_8b_it-f16-iMat-Q5_K_S_q4_v6.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta_Llama_3.1_8b_it
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = llama3.1
llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   6:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   7:                          llama.block_count u32              = 32
llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                          general.file_type u32              = 16
llama_model_loader: - kv  16:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,280147]  = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - kv  27:                      quantize.imatrix.file str              = Q:\iMatrix\Meta_Llama_3.1_8b_it-f16.i...
llama_model_loader: - kv  28:                   quantize.imatrix.dataset str              = groups_merged-enhancedV3_FR_SRB_HR.txt
llama_model_loader: - kv  29:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  30:              quantize.imatrix.chunks_count i32              = 145
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q5_K:  161 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q5_K - Small
llm_load_print_meta: model params     = 8.030 B
llm_load_print_meta: model size       =   5542174976.00 Bytes (5.521 BPW)
llm_load_print_meta: model size       =      5542174.98 KB    (5.521 BPW)
llm_load_print_meta: model size       =      5412280.25 KiB   (5.521 BPW)
llm_load_print_meta: model size       =         5542.17 MB    (5.521 BPW)
llm_load_print_meta: model size       =         5285.43 MiB   (5.521 BPW)
llm_load_print_meta: model size       =            5.54 GB    (5.521 BPW)
llm_load_print_meta: model size       =            5.16 GiB   (5.521 BPW)
llm_load_print_meta: repeating layers =   4750065920.00 Bytes (5.445 BPW)
llm_load_print_meta: repeating layers =      4750065.92 KB    (5.445 BPW)
llm_load_print_meta: repeating layers =      4638736.25 KiB   (5.445 BPW)
llm_load_print_meta: repeating layers =         4750.07 MB    (5.445 BPW)
llm_load_print_meta: repeating layers =         4530.02 MiB   (5.445 BPW)
llm_load_print_meta: repeating layers =            4.75 GB    (5.445 BPW)
llm_load_print_meta: repeating layers =            4.42 GiB   (5.445 BPW)
, 6.980 B parameters)
llm_load_print_meta: general.name     = Meta_Llama_3.1_8b_it
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   344.44 MiB
llm_load_tensors:      CUDA0 buffer size =  4941.00 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    98.00 MiB
llama_new_context_with_model: KV self size  =   98.00 MiB, K (q8_0):   34.00 MiB, V (f16):   64.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 271.345 ms
perplexity: calculating perplexity over 211 chunks, n_ctx=512, batch_size=1024, n_seq=2
perplexity: 0.89 seconds per pass - ETA 1.57 minutes
[1]-nan,[2]-nan,[3]-nan,[4]-nan,[5]-nan,[6]-nan,[7]-nan,[8]-nan,[9]-nan,[10]-nan,[11]-nan,[12]-nan,[13]-nan,[14]-nan,[15]-nan,[16]-nan,[17]-nan,[18]-nan,[19]-nan,[20]-nan,[21]-nan,[22]-nan,[23]-nan,[24]-nan,[25]-nan,[26]-nan,[27]-nan,[28]-nan,[29]-nan,[30]-nan,[31]-nan,[32]-nan,[33]-nan,[34]-nan,[35]-nan,[36]-nan,[37]-nan,[38]-nan,[39]
@Nexesenex Nexesenex added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Oct 23, 2024
@JohannesGaessler
Copy link
Collaborator

Since you say "anymore", can you do a git bisect and identify the commit that introduced the issue?

@ikawrakow
Copy link
Contributor

@JohannesGaessler See my comment here

@Nexesenex
Copy link
Contributor Author

@JohannesGaessler : I'm spared a headache here. @ikawrakow : Thanks!

@JohannesGaessler
Copy link
Collaborator

If so, does compiling with GGML_CUDA_FORCE_CUBLAS fix the issue?

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 23, 2024

Regarding numerical precision in the KQ matrix multiplication: I didn't actually check this but I very much suspect the problem to be the reduced numerical range of FP16 rather than the reduced precision vs. FP32. The MMQ kernels do the dot product using integers, upcast the FP16 scales of the blocks to FP32, then do the scaling and accumulation at FP32 precision. That should not result in any precision issues vs. regular matrix multiplications so I expect any problems to just come from regular bugs.

@JohannesGaessler
Copy link
Collaborator

I can reproduce the issue. It is caused by the use of MMQ but since MMVQ works correctly (and essentially does the same operations) it should not be an issue with numerics.

@ikawrakow
Copy link
Contributor

...but since MMVQ works correctly

Speaking of which, looking at the vec_dot_q4_0_q8_1_impl function I notice that it takes a const reference to a float as the Q4_0 scale:

template <int vdr> static __device__ __forceinline__ float vec_dot_q4_0_q8_1_impl(
    const int * v, const int * u, const float & d4, const half2 & ds8) {

But then it gets called from the vec_dot_q4_0_q8_1 kernel like this:

 return vec_dot_q4_0_q8_1_impl<VDR_Q4_0_Q8_1_MMVQ>(v, u, bq4_0->d, bq8_1->ds);

i.e., with a reference to ggml_half. How is this supposed to work? Is the nvcc compiler auto-fixing such bugs?

@JohannesGaessler
Copy link
Collaborator

When you supply a half value to a float argument in CUDA it is automatically cast to float.

@ikawrakow
Copy link
Contributor

When you supply a half value to a float argument in CUDA it is automatically cast to float.

That would be true if it was passed by value. But this functions takes a const reference. In normal (non-CUDA) C++ this would be compile time error. What happens here? The compiler silently replaces pass by reference with pass by value and implicitly converts the argument? This does not sound right.

@slaren
Copy link
Member

slaren commented Oct 23, 2024

Temporaries can be bound to const-references, this is standard C++ behavior.

@JohannesGaessler
Copy link
Collaborator

You can check the resulting PTX code in NSight Compute. This is a snippet from the relevant section:

sub.s32 	%r49, %r47, %r8;
ld.global.nc.u16 	%rs4, [%rd18+6];
ld.global.nc.u16 	%rs5, [%rd18+8];
mov.b32 	%r50, {%rs4, %rs5};
mul.wide.s32 	%rd24, %r49, 4;
add.s64 	%rd25, %rd21, %rd24;
ld.global.nc.u32 	%r37, [%rd25];
ld.global.nc.u32 	%r41, [%rd23+20];
ld.global.nc.u16 	%rs1, [%rd16];
{  cvt.f32.f16 %f11, %rs1;}
and.b32  	%r28, %r48, 252645135;
shr.u32 	%r51, %r48, 4;
and.b32  	%r32, %r51, 252645135;
mov.u32 	%r30, 0;
dp4a.s32.s32 %r27, %r28, %r29, %r30;

As you can see it has a cvt.f32.f16 instruction.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 23, 2024

I think the problem has to do with the KV cache dimensions. Because of how slow conditional statements are on GPUs out-of-bounds reads in the ne00/ne10 dimension are handled by padding ne10 with zeros. The GPU will then read in and process slightly more data but that is faster than checking the matrix dimensions. For this to work the last row of src0 also needs to be padded with zeros. I think this is not being done for the K cache, thus if any of the data that is being read encodes a non-finite scale the result as a whole becomes NaN. The ne00 dimension for the KQ calculation is the head size whcih is 128 for LLaMA. Gemma 1 with a head size of 256 seems to work correctly when quantizing the K cache.

@slaren
Copy link
Member

slaren commented Oct 23, 2024

The KV cache is initialized to zero in llama_kv_cache_init (precisely to avoid NaNs while padding), shouldn't this avoid this problem?

@Nexesenex Nexesenex changed the title Bug: Bug: K cache without FA goes Nan on Llama 3.1. Bug: K cache without FA goes Nan on Llama 3.1. Oct 23, 2024
@JohannesGaessler
Copy link
Collaborator

I misinterpreted the results from Gemma 1. The reason why the results are correct is not the head size but the fact that that particular model only has a single head. As a result the K cache data layout does not change when dimensions 1 and 2 are permuted. I extended test-backend-ops to also test for non-contiguous input matrices and those tests fail for CUDA. I'll fix it and make a PR.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 23, 2024

Should be fixed by #10021 . This issue seems to have been on master for a long time and apparently no one noticed; I myself thought I had tested K cache quantization but I obviously must have done something wrong.

@Dampfinchen
Copy link

Should be fixed by #10021 . This issue seems to have been on master for a long time and apparently no one noticed; I myself thought I had tested K cache quantization but I obviously must have done something wrong.

Might be related to what I was posting here: #8853

@JohannesGaessler
Copy link
Collaborator

No, that was before I ever touched the MMQ code and I vaguely remember that there was another issue with K cache quantization beforehand.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Oct 24, 2024

There still seems to be an issue with the combination of a q8_0 model with -ctk q8_0 and no FA, see ikawrakow/ik_llama.cpp#103 (comment) .

@JohannesGaessler
Copy link
Collaborator

Should be fixed by #10032 (for real this time).

@Nexesenex
Copy link
Contributor Author

@JohannesGaessler, congrats and thank you!

Using #10015 , #10021 , #10032 all-together, Quantum cache K without FA now works again for every mode, aka. K F16, q8_0, q5_1, q5_0, q4_1, q4_0, and even iq4_nl (which also works for PPL test in -FA KV mode iq4_nl/iq4_nl).

As a side note, it might be pertinent to enable K iq4_nl in generation also, and why not in FA mode. @ikawrakow made the work on IK_Llama, and all users would benefit from having this option over the q4_0 K non-FA / KV FA cache, which perplexity is a whole 1% higher for the same bitrate. And I merged it on my KoboldCPP fork with sheer enthusiasm.

@JohannesGaessler
Copy link
Collaborator

If I remember correctly at the time that I added CUDA/FA support for quantized KV caches iq4_nl was not an option. I will revisit FA in the next few months and add support for iq4_nl while I'm at it (unless I forget).

@Nexesenex
Copy link
Contributor Author

Okay. Thanks again, Johannes, for the fast fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Projects
None yet
5 participants