cuda : add f32 to bf16 copy op #12806

CISC · 2025-04-07T20:20:34Z

This allows BF16 KV-cache on CUDA.

Full disclosure: I noticed this in ik_llama.cpp repo, but this is not an upstream, it was a simple feature to add.

Originally submitted as ggml-org/ggml#1182 but moved here to avoid sync conflicts.

JohannesGaessler · 2025-04-07T21:00:26Z

Did you look at any IK code to figure out how to do the implementation or did you just see that the repository has the feature and then figure out how to do the implementation completely on your own?

CISC · 2025-04-08T02:47:12Z

Did you look at any IK code to figure out how to do the implementation or did you just see that the repository has the feature and then figure out how to do the implementation completely on your own?

I quite likely looked at it at some point, as I went through all the PRs, but since I now knew this might be an issue I decided to do this by simply trying -ctk bf16 -ctv bf16, which threw an OP_CPY error, it turned out to be rather simple from there.

CISC · 2025-04-08T03:21:47Z

What is obviously missing here BTW is f16 to bf16 copy, but I don't know what/if that is ever triggered?

ggml/src/ggml-cuda/cpy.cu

jukofyork · 2025-04-08T19:35:58Z

Have you noticed any benefit to using a BF16 KV-cache? I wanted to try this before but ran into the same copy error so interested to try it now and see what difference it makes for CUDA.

Co-authored-by: Johannes Gäßler <[email protected]>

CISC · 2025-04-08T20:38:46Z

Have you noticed any benefit to using a BF16 KV-cache? I wanted to try this before but ran into the same copy error so interested to try it now and see what difference it makes for CUDA.

I didn't run any benchmarks, but it should theoretically improve PPL compared to F16, at least on certain models. Speed-wise it can swing both ways depending on hardware I guess.

* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h

This allows BF16 KV-cache on CUDA.

add f32 to bf16 copy op

de49592

CISC requested a review from JohannesGaessler April 7, 2025 20:21

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 7, 2025

JohannesGaessler approved these changes Apr 8, 2025

View reviewed changes

ggml/src/ggml-cuda/cpy.cu Outdated Show resolved Hide resolved

alignment

5f63f2d

Co-authored-by: Johannes Gäßler <[email protected]>

CISC merged commit 7538246 into ggml-org:master Apr 8, 2025
51 checks passed

CISC deleted the cuda-bf16-kv-cache branch April 8, 2025 21:21

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Apr 11, 2025

cuda : add f32 to bf16 copy op (ggml-org#12806)

90f6637

This allows BF16 KV-cache on CUDA.

colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025

cuda : add f32 to bf16 copy op (ggml-org#12806)

2eff5a3

This allows BF16 KV-cache on CUDA.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuda : add f32 to bf16 copy op #12806

cuda : add f32 to bf16 copy op #12806

CISC commented Apr 7, 2025

JohannesGaessler commented Apr 7, 2025

CISC commented Apr 8, 2025

CISC commented Apr 8, 2025 •

edited

Loading

jukofyork commented Apr 8, 2025

CISC commented Apr 8, 2025

cuda : add f32 to bf16 copy op #12806

cuda : add f32 to bf16 copy op #12806

Conversation

CISC commented Apr 7, 2025

JohannesGaessler commented Apr 7, 2025

CISC commented Apr 8, 2025

CISC commented Apr 8, 2025 • edited Loading

jukofyork commented Apr 8, 2025

CISC commented Apr 8, 2025

CISC commented Apr 8, 2025 •

edited

Loading