Bug: K cache without FA goes Nan on Llama 3.1. #10011

Nexesenex · 2024-10-23T07:52:32Z

What happened?

With the non-FA Quantum K cache, q4_0, q4_1, q5_0, q5_1, q8_0 do not work anymore, and go NaN instead..

Tested on Llama 3.1 8b Q5_K.

Name and Version

llama-cli, b3962, (written 4104 due to my edits on quant strategies)

What operating system are you seeing the problem on?

Windows

Relevant log output

Q:\LLAMA_CUDA_123>llama-perplexity -m D:\text-generation-webui\models\Meta_Llama_3.1_8b_it-f16-iMat-Q5_K_S_q4_v6.gguf -f wiki.test.raw --parallel 1 -ngl 150 -b 1024 -ts 40,0 --no-mmap -ctk q8_0 -c 512 --chunks 211
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
build: 4104 (aa180f4cb) with MSVC 19.38.33141.0 for
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_load_model_from_file: using device CUDA1 (NVIDIA GeForce RTX 3090) - 23306 MiB free
llama_model_loader: loaded meta data with 31 key-value pairs and 292 tensors from D:\text-generation-webui\models\Meta_Llama_3.1_8b_it-f16-iMat-Q5_K_S_q4_v6.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta_Llama_3.1_8b_it
llama_model_loader: - kv   3:                         general.size_label str              = 8.0B
llama_model_loader: - kv   4:                            general.license str              = llama3.1
llama_model_loader: - kv   5:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   6:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   7:                          llama.block_count u32              = 32
llama_model_loader: - kv   8:                       llama.context_length u32              = 131072
llama_model_loader: - kv   9:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  10:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  11:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  12:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  14:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                          general.file_type u32              = 16
llama_model_loader: - kv  16:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  17:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  18:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  19:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  20:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  22:                      tokenizer.ggml.merges arr[str,280147]  = ["─á ─á", "─á ─á─á─á", "─á─á ─á─á", "...
llama_model_loader: - kv  23:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  26:               general.quantization_version u32              = 2
llama_model_loader: - kv  27:                      quantize.imatrix.file str              = Q:\iMatrix\Meta_Llama_3.1_8b_it-f16.i...
llama_model_loader: - kv  28:                   quantize.imatrix.dataset str              = groups_merged-enhancedV3_FR_SRB_HR.txt
llama_model_loader: - kv  29:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  30:              quantize.imatrix.chunks_count i32              = 145
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:   32 tensors
llama_model_loader: - type q5_K:  161 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q5_K - Small
llm_load_print_meta: model params     = 8.030 B
llm_load_print_meta: model size       =   5542174976.00 Bytes (5.521 BPW)
llm_load_print_meta: model size       =      5542174.98 KB    (5.521 BPW)
llm_load_print_meta: model size       =      5412280.25 KiB   (5.521 BPW)
llm_load_print_meta: model size       =         5542.17 MB    (5.521 BPW)
llm_load_print_meta: model size       =         5285.43 MiB   (5.521 BPW)
llm_load_print_meta: model size       =            5.54 GB    (5.521 BPW)
llm_load_print_meta: model size       =            5.16 GiB   (5.521 BPW)
llm_load_print_meta: repeating layers =   4750065920.00 Bytes (5.445 BPW)
llm_load_print_meta: repeating layers =      4750065.92 KB    (5.445 BPW)
llm_load_print_meta: repeating layers =      4638736.25 KiB   (5.445 BPW)
llm_load_print_meta: repeating layers =         4750.07 MB    (5.445 BPW)
llm_load_print_meta: repeating layers =         4530.02 MiB   (5.445 BPW)
llm_load_print_meta: repeating layers =            4.75 GB    (5.445 BPW)
llm_load_print_meta: repeating layers =            4.42 GiB   (5.445 BPW)
, 6.980 B parameters)
llm_load_print_meta: general.name     = Meta_Llama_3.1_8b_it
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
llm_load_print_meta: LF token         = 128 '├ä'
llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size =   344.44 MiB
llm_load_tensors:      CUDA0 buffer size =  4941.00 MiB
........................................................................................
llama_new_context_with_model: n_ctx      = 1024
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =    98.00 MiB
llama_new_context_with_model: KV self size  =   98.00 MiB, K (q8_0):   34.00 MiB, V (f16):   64.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model:      CUDA0 compute buffer size =   258.50 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
perplexity: tokenizing the input ..
perplexity: tokenization took 271.345 ms
perplexity: calculating perplexity over 211 chunks, n_ctx=512, batch_size=1024, n_seq=2
perplexity: 0.89 seconds per pass - ETA 1.57 minutes
[1]-nan,[2]-nan,[3]-nan,[4]-nan,[5]-nan,[6]-nan,[7]-nan,[8]-nan,[9]-nan,[10]-nan,[11]-nan,[12]-nan,[13]-nan,[14]-nan,[15]-nan,[16]-nan,[17]-nan,[18]-nan,[19]-nan,[20]-nan,[21]-nan,[22]-nan,[23]-nan,[24]-nan,[25]-nan,[26]-nan,[27]-nan,[28]-nan,[29]-nan,[30]-nan,[31]-nan,[32]-nan,[33]-nan,[34]-nan,[35]-nan,[36]-nan,[37]-nan,[38]-nan,[39]

JohannesGaessler · 2024-10-23T08:06:28Z

Since you say "anymore", can you do a git bisect and identify the commit that introduced the issue?

ikawrakow · 2024-10-23T08:08:14Z

@JohannesGaessler See my comment here

Nexesenex · 2024-10-23T08:15:50Z

@JohannesGaessler : I'm spared a headache here. @ikawrakow : Thanks!

JohannesGaessler · 2024-10-23T08:28:56Z

If so, does compiling with GGML_CUDA_FORCE_CUBLAS fix the issue?

JohannesGaessler · 2024-10-23T08:36:21Z

Regarding numerical precision in the KQ matrix multiplication: I didn't actually check this but I very much suspect the problem to be the reduced numerical range of FP16 rather than the reduced precision vs. FP32. The MMQ kernels do the dot product using integers, upcast the FP16 scales of the blocks to FP32, then do the scaling and accumulation at FP32 precision. That should not result in any precision issues vs. regular matrix multiplications so I expect any problems to just come from regular bugs.

JohannesGaessler · 2024-10-23T09:48:08Z

I can reproduce the issue. It is caused by the use of MMQ but since MMVQ works correctly (and essentially does the same operations) it should not be an issue with numerics.

ikawrakow · 2024-10-23T10:14:39Z

...but since MMVQ works correctly

Speaking of which, looking at the vec_dot_q4_0_q8_1_impl function I notice that it takes a const reference to a float as the Q4_0 scale:

template <int vdr> static __device__ __forceinline__ float vec_dot_q4_0_q8_1_impl(
    const int * v, const int * u, const float & d4, const half2 & ds8) {

But then it gets called from the vec_dot_q4_0_q8_1 kernel like this:

 return vec_dot_q4_0_q8_1_impl<VDR_Q4_0_Q8_1_MMVQ>(v, u, bq4_0->d, bq8_1->ds);

i.e., with a reference to ggml_half. How is this supposed to work? Is the nvcc compiler auto-fixing such bugs?

JohannesGaessler · 2024-10-23T10:17:40Z

When you supply a half value to a float argument in CUDA it is automatically cast to float.

ikawrakow · 2024-10-23T10:39:26Z

When you supply a half value to a float argument in CUDA it is automatically cast to float.

That would be true if it was passed by value. But this functions takes a const reference. In normal (non-CUDA) C++ this would be compile time error. What happens here? The compiler silently replaces pass by reference with pass by value and implicitly converts the argument? This does not sound right.

slaren · 2024-10-23T10:42:02Z

Temporaries can be bound to const-references, this is standard C++ behavior.

JohannesGaessler · 2024-10-23T10:51:12Z

You can check the resulting PTX code in NSight Compute. This is a snippet from the relevant section:

sub.s32 	%r49, %r47, %r8;
ld.global.nc.u16 	%rs4, [%rd18+6];
ld.global.nc.u16 	%rs5, [%rd18+8];
mov.b32 	%r50, {%rs4, %rs5};
mul.wide.s32 	%rd24, %r49, 4;
add.s64 	%rd25, %rd21, %rd24;
ld.global.nc.u32 	%r37, [%rd25];
ld.global.nc.u32 	%r41, [%rd23+20];
ld.global.nc.u16 	%rs1, [%rd16];
{  cvt.f32.f16 %f11, %rs1;}
and.b32  	%r28, %r48, 252645135;
shr.u32 	%r51, %r48, 4;
and.b32  	%r32, %r51, 252645135;
mov.u32 	%r30, 0;
dp4a.s32.s32 %r27, %r28, %r29, %r30;

As you can see it has a cvt.f32.f16 instruction.

JohannesGaessler · 2024-10-23T11:02:20Z

I think the problem has to do with the KV cache dimensions. Because of how slow conditional statements are on GPUs out-of-bounds reads in the ne00/ne10 dimension are handled by padding ne10 with zeros. The GPU will then read in and process slightly more data but that is faster than checking the matrix dimensions. For this to work the last row of src0 also needs to be padded with zeros. I think this is not being done for the K cache, thus if any of the data that is being read encodes a non-finite scale the result as a whole becomes NaN. The ne00 dimension for the KQ calculation is the head size whcih is 128 for LLaMA. Gemma 1 with a head size of 256 seems to work correctly when quantizing the K cache.

slaren · 2024-10-23T11:05:14Z

The KV cache is initialized to zero in llama_kv_cache_init (precisely to avoid NaNs while padding), shouldn't this avoid this problem?

JohannesGaessler · 2024-10-23T12:55:26Z

I misinterpreted the results from Gemma 1. The reason why the results are correct is not the head size but the fact that that particular model only has a single head. As a result the K cache data layout does not change when dimensions 1 and 2 are permuted. I extended test-backend-ops to also test for non-contiguous input matrices and those tests fail for CUDA. I'll fix it and make a PR.

JohannesGaessler · 2024-10-23T18:45:34Z

Should be fixed by #10021 . This issue seems to have been on master for a long time and apparently no one noticed; I myself thought I had tested K cache quantization but I obviously must have done something wrong.

Dampfinchen · 2024-10-23T19:47:49Z

Should be fixed by #10021 . This issue seems to have been on master for a long time and apparently no one noticed; I myself thought I had tested K cache quantization but I obviously must have done something wrong.

Might be related to what I was posting here: #8853

JohannesGaessler · 2024-10-23T19:50:44Z

No, that was before I ever touched the MMQ code and I vaguely remember that there was another issue with K cache quantization beforehand.

JohannesGaessler · 2024-10-24T09:11:12Z

There still seems to be an issue with the combination of a q8_0 model with -ctk q8_0 and no FA, see ikawrakow/ik_llama.cpp#103 (comment) .

JohannesGaessler · 2024-10-24T09:57:46Z

Should be fixed by #10032 (for real this time).

Nexesenex · 2024-10-24T12:37:38Z

@JohannesGaessler, congrats and thank you!

Using #10015 , #10021 , #10032 all-together, Quantum cache K without FA now works again for every mode, aka. K F16, q8_0, q5_1, q5_0, q4_1, q4_0, and even iq4_nl (which also works for PPL test in -FA KV mode iq4_nl/iq4_nl).

As a side note, it might be pertinent to enable K iq4_nl in generation also, and why not in FA mode. @ikawrakow made the work on IK_Llama, and all users would benefit from having this option over the q4_0 K non-FA / KV FA cache, which perplexity is a whole 1% higher for the same bitrate. And I merged it on my KoboldCPP fork with sheer enthusiasm.

JohannesGaessler · 2024-10-24T12:43:58Z

If I remember correctly at the time that I added CUDA/FA support for quantized KV caches iq4_nl was not an option. I will revisit FA in the next few months and add support for iq4_nl while I'm at it (unless I forget).

Nexesenex · 2024-10-24T12:57:48Z

Okay. Thanks again, Johannes, for the fast fixes.

Nexesenex added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Oct 23, 2024

Nexesenex changed the title ~~Bug: Bug: K cache without FA goes Nan on Llama 3.1.~~ Bug: K cache without FA goes Nan on Llama 3.1. Oct 23, 2024

JohannesGaessler mentioned this issue Oct 23, 2024

CUDA: fix MMQ for non-contiguous src0, add tests #10021

Merged

JohannesGaessler closed this as completed in #10021 Oct 24, 2024

JohannesGaessler reopened this Oct 24, 2024

JohannesGaessler mentioned this issue Oct 24, 2024

CUDA: fix insufficient buffer clearing for MMQ #10032

Merged

Nexesenex mentioned this issue Oct 24, 2024

Bug: K cache without FA ikawrakow/ik_llama.cpp#103

Closed

JohannesGaessler closed this as completed in #10032 Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: K cache without FA goes Nan on Llama 3.1. #10011

Bug: K cache without FA goes Nan on Llama 3.1. #10011

Nexesenex commented Oct 23, 2024 •

edited

Loading

JohannesGaessler commented Oct 23, 2024

ikawrakow commented Oct 23, 2024

Nexesenex commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024 •

edited

Loading

JohannesGaessler commented Oct 23, 2024

ikawrakow commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

ikawrakow commented Oct 23, 2024

slaren commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024 •

edited

Loading

slaren commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024 •

edited

Loading

Dampfinchen commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

JohannesGaessler commented Oct 24, 2024 •

edited

Loading

JohannesGaessler commented Oct 24, 2024

Nexesenex commented Oct 24, 2024

JohannesGaessler commented Oct 24, 2024

Nexesenex commented Oct 24, 2024

Bug: K cache without FA goes Nan on Llama 3.1. #10011

Bug: K cache without FA goes Nan on Llama 3.1. #10011

Comments

Nexesenex commented Oct 23, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

JohannesGaessler commented Oct 23, 2024

ikawrakow commented Oct 23, 2024

Nexesenex commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024 • edited Loading

JohannesGaessler commented Oct 23, 2024

ikawrakow commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

ikawrakow commented Oct 23, 2024

slaren commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024 • edited Loading

slaren commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024 • edited Loading

Dampfinchen commented Oct 23, 2024

JohannesGaessler commented Oct 23, 2024

JohannesGaessler commented Oct 24, 2024 • edited Loading

JohannesGaessler commented Oct 24, 2024

Nexesenex commented Oct 24, 2024

JohannesGaessler commented Oct 24, 2024

Nexesenex commented Oct 24, 2024

Nexesenex commented Oct 23, 2024 •

edited

Loading

JohannesGaessler commented Oct 23, 2024 •

edited

Loading

JohannesGaessler commented Oct 23, 2024 •

edited

Loading

JohannesGaessler commented Oct 23, 2024 •

edited

Loading

JohannesGaessler commented Oct 24, 2024 •

edited

Loading