llama : fix FA when KV cache is not used (i.e. embeddings) #12825
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fix #12815
When computing the attention and the KV cache is not used (e.g. for embedding models with non-causal attention) the FA branch didn't cast the K and V tensors and they remained in F32 format. We now cast them to F16 to avoid adding full FA support for F32 types.
Also:
TODO:
We still allocate the KV cache for embeddings models, even though it is not used by the computation graph. This should be fixed by refactoring
llama-context
to allowkv_self
to not be constructed. This would save a lot of VRAM and will improve the embedding performance because we will no longer search for KV cache slots for each batch. Might try to fix this in #12799.