Skip to content

llama : fix FA when KV cache is not used (i.e. embeddings) #12825

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 8, 2025

Conversation

ggerganov
Copy link
Member

fix #12815

When computing the attention and the KV cache is not used (e.g. for embedding models with non-causal attention) the FA branch didn't cast the K and V tensors and they remained in F32 format. We now cast them to F16 to avoid adding full FA support for F32 types.

Also:

  • CPU FA supports V-type of F32 (not used, but supported now)
  • Add server test that exercises embeddings with FA enabled

TODO:

We still allocate the KV cache for embeddings models, even though it is not used by the computation graph. This should be fixed by refactoring llama-context to allow kv_self to not be constructed. This would save a lot of VRAM and will improve the embedding performance because we will no longer search for KV cache slots for each batch. Might try to fix this in #12799.

@ggerganov ggerganov requested a review from ngxson as a code owner April 8, 2025 10:52
@github-actions github-actions bot added examples python python script changes server ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Apr 8, 2025
@ggerganov ggerganov merged commit a19b5ce into master Apr 8, 2025
62 checks passed
@ggerganov ggerganov deleted the gg/embd-fix-fa branch April 8, 2025 16:54
tastelikefeet added a commit to tastelikefeet/llama.cpp that referenced this pull request Apr 10, 2025
* master: (123 commits)
  cuda : add f32 to bf16 copy op (ggml-org#12806)
  llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834)
  llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825)
  server : fix thread.join() on exit (ggml-org#12831)
  llava: add more helper functions to check projector types in clip context (ggml-org#12824)
  arg : Including limits file on AIX (ggml-org#12822)
  server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785)
  Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812)
  gguf-py : support lazy tensor splitting (ggml-org#12809)
  llama : Support llama 4 text-only (ggml-org#12791)
  opencl: better identify Adreno GPU (ggml-org#12760)
  hellaswag: display estimated score confidence interval (ggml-org#12797)
  cuda : fix HIP and MUSA BF16 (#0)
  sync : ggml
  ggml : simplify Arm fp16 CPU logic (ggml/1177)
  CUDA: don't convert BF16 weights to FP32 (ggml/1174)
  cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167)
  sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734)
  ci : no curl on ggml-ci (ggml-org#12796)
  cmake : enable curl by default (ggml-org#12761)
  ...

# Conflicts:
#	common/arg.cpp
#	common/common.cpp
#	common/common.h
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
…12825)

* ggml : FA supports F32 V

* graph : cast KV to F16 when the KV cache is not used

ggml-ci

* server : add test that exercises embeddings with FA enabled

ggml-ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Eval bug: GGML_ASSERT(q_to_vec_dot && "fattn: unsupported K-type") failed with Vulkan
2 participants