llama : fix FA when KV cache is not used (i.e. embeddings) #12825

ggerganov · 2025-04-08T10:52:27Z

When computing the attention and the KV cache is not used (e.g. for embedding models with non-causal attention) the FA branch didn't cast the K and V tensors and they remained in F32 format. We now cast them to F16 to avoid adding full FA support for F32 types.

Also:

CPU FA supports V-type of F32 (not used, but supported now)
Add server test that exercises embeddings with FA enabled

TODO:

We still allocate the KV cache for embeddings models, even though it is not used by the computation graph. This should be fixed by refactoring llama-context to allow kv_self to not be constructed. This would save a lot of VRAM and will improve the embedding performance because we will no longer search for KV cache slots for each batch. Might try to fix this in #12799.

ggml-ci

* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h

…12825) * ggml : FA supports F32 V * graph : cast KV to F16 when the KV cache is not used ggml-ci * server : add test that exercises embeddings with FA enabled ggml-ci

ggerganov added 3 commits April 8, 2025 13:22

ggml : FA supports F32 V

3e6d1e4

graph : cast KV to F16 when the KV cache is not used

7cb9ae0

ggml-ci

server : add test that exercises embeddings with FA enabled

997b1b4

ggml-ci

ggerganov requested a review from ngxson as a code owner April 8, 2025 10:52

github-actions bot added examples python python script changes server ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 8, 2025

ggerganov mentioned this pull request Apr 8, 2025

Eval bug: GGML_ASSERT(q_to_vec_dot && "fattn: unsupported K-type") failed with Vulkan #12815

Closed

ngxson approved these changes Apr 8, 2025

View reviewed changes

ggerganov merged commit a19b5ce into master Apr 8, 2025
62 checks passed

ggerganov deleted the gg/embd-fix-fa branch April 8, 2025 16:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : fix FA when KV cache is not used (i.e. embeddings) #12825

llama : fix FA when KV cache is not used (i.e. embeddings) #12825

ggerganov commented Apr 8, 2025

llama : fix FA when KV cache is not used (i.e. embeddings) #12825

llama : fix FA when KV cache is not used (i.e. embeddings) #12825

Conversation

ggerganov commented Apr 8, 2025