Skip to content

cuda : add f32 to bf16 copy op #12806

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 8, 2025
Merged

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented Apr 7, 2025

This allows BF16 KV-cache on CUDA.

Full disclosure: I noticed this in ik_llama.cpp repo, but this is not an upstream, it was a simple feature to add.

Originally submitted as ggml-org/ggml#1182 but moved here to avoid sync conflicts.

@CISC CISC requested a review from JohannesGaessler April 7, 2025 20:21
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Apr 7, 2025
@JohannesGaessler
Copy link
Collaborator

Did you look at any IK code to figure out how to do the implementation or did you just see that the repository has the feature and then figure out how to do the implementation completely on your own?

@CISC
Copy link
Collaborator Author

CISC commented Apr 8, 2025

Did you look at any IK code to figure out how to do the implementation or did you just see that the repository has the feature and then figure out how to do the implementation completely on your own?

I quite likely looked at it at some point, as I went through all the PRs, but since I now knew this might be an issue I decided to do this by simply trying -ctk bf16 -ctv bf16, which threw an OP_CPY error, it turned out to be rather simple from there.

@CISC
Copy link
Collaborator Author

CISC commented Apr 8, 2025

What is obviously missing here BTW is f16 to bf16 copy, but I don't know what/if that is ever triggered?

@jukofyork
Copy link
Collaborator

Have you noticed any benefit to using a BF16 KV-cache? I wanted to try this before but ran into the same copy error so interested to try it now and see what difference it makes for CUDA.

Co-authored-by: Johannes Gäßler <[email protected]>
@CISC
Copy link
Collaborator Author

CISC commented Apr 8, 2025

Have you noticed any benefit to using a BF16 KV-cache? I wanted to try this before but ran into the same copy error so interested to try it now and see what difference it makes for CUDA.

I didn't run any benchmarks, but it should theoretically improve PPL compared to F16, at least on certain models. Speed-wise it can swing both ways depending on hardware I guess.

@CISC CISC merged commit 7538246 into ggml-org:master Apr 8, 2025
51 checks passed
@CISC CISC deleted the cuda-bf16-kv-cache branch April 8, 2025 21:21
tastelikefeet added a commit to tastelikefeet/llama.cpp that referenced this pull request Apr 10, 2025
* master: (123 commits)
  cuda : add f32 to bf16 copy op (ggml-org#12806)
  llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834)
  llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825)
  server : fix thread.join() on exit (ggml-org#12831)
  llava: add more helper functions to check projector types in clip context (ggml-org#12824)
  arg : Including limits file on AIX (ggml-org#12822)
  server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785)
  Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812)
  gguf-py : support lazy tensor splitting (ggml-org#12809)
  llama : Support llama 4 text-only (ggml-org#12791)
  opencl: better identify Adreno GPU (ggml-org#12760)
  hellaswag: display estimated score confidence interval (ggml-org#12797)
  cuda : fix HIP and MUSA BF16 (#0)
  sync : ggml
  ggml : simplify Arm fp16 CPU logic (ggml/1177)
  CUDA: don't convert BF16 weights to FP32 (ggml/1174)
  cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167)
  sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734)
  ci : no curl on ggml-ci (ggml-org#12796)
  cmake : enable curl by default (ggml-org#12761)
  ...

# Conflicts:
#	common/arg.cpp
#	common/common.cpp
#	common/common.h
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Apr 11, 2025
This allows BF16 KV-cache on CUDA.
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
This allows BF16 KV-cache on CUDA.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants