-
Notifications
You must be signed in to change notification settings - Fork 11.5k
cuda : add f32 to bf16 copy op #12806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Did you look at any IK code to figure out how to do the implementation or did you just see that the repository has the feature and then figure out how to do the implementation completely on your own? |
I quite likely looked at it at some point, as I went through all the PRs, but since I now knew this might be an issue I decided to do this by simply trying |
What is obviously missing here BTW is f16 to bf16 copy, but I don't know what/if that is ever triggered? |
Have you noticed any benefit to using a |
Co-authored-by: Johannes Gäßler <[email protected]>
I didn't run any benchmarks, but it should theoretically improve PPL compared to F16, at least on certain models. Speed-wise it can swing both ways depending on hardware I guess. |
* master: (123 commits) cuda : add f32 to bf16 copy op (ggml-org#12806) llava: improve clip_ctx destructor to not memleak load_image_size (ggml-org#12834) llama : fix FA when KV cache is not used (i.e. embeddings) (ggml-org#12825) server : fix thread.join() on exit (ggml-org#12831) llava: add more helper functions to check projector types in clip context (ggml-org#12824) arg : Including limits file on AIX (ggml-org#12822) server : webui : Improve Chat Input with Auto-Sizing Textarea (ggml-org#12785) Revert "sycl:remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor" (ggml-org#12812) gguf-py : support lazy tensor splitting (ggml-org#12809) llama : Support llama 4 text-only (ggml-org#12791) opencl: better identify Adreno GPU (ggml-org#12760) hellaswag: display estimated score confidence interval (ggml-org#12797) cuda : fix HIP and MUSA BF16 (#0) sync : ggml ggml : simplify Arm fp16 CPU logic (ggml/1177) CUDA: don't convert BF16 weights to FP32 (ggml/1174) cpu: move all the operators into a separate c++ file (except mul_mat) (ggml/1167) sycl: remove redundant memcopy in function ggml_backend_sycl_buffer_set_tensor (ggml-org#12734) ci : no curl on ggml-ci (ggml-org#12796) cmake : enable curl by default (ggml-org#12761) ... # Conflicts: # common/arg.cpp # common/common.cpp # common/common.h
This allows BF16 KV-cache on CUDA.
This allows BF16 KV-cache on CUDA.
This allows BF16 KV-cache on CUDA.
Full disclosure: I noticed this in
ik_llama.cpp
repo, but this is not an upstream, it was a simple feature to add.Originally submitted as ggml-org/ggml#1182 but moved here to avoid sync conflicts.