Skip to content

CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 #7681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Jun 1, 2024

The check against __CUDA_ARCH__ for the quantized KV cache is incorrect (> instead of >=). This PR fixes this.

It also implements the dequantization of the KV cache for large batch sizes to FP16 in order to make use of the kernels optimized for large batch sizes. I didn't put this into #7527 because I thought it would cause issues for some cases but it seems to be a straight upgrade:

GPU Model Microbatch size Test t/s vec dot t/s dequantize Speedup VRAM use vec dot [MiB] VRAM use dequantize [MiB] Difference [MiB]
RTX 4090 llama 8B Q4_0 16 pp4096 690.25 1262.91 1.83 4698 4714 16
RTX 4090 llama 8B Q4_0 32 pp4096 1019.16 1487.62 1.46 4706 4722 16
RTX 4090 llama 8B Q4_0 64 pp4096 1364.91 1756.76 1.29 4848 4848 0
RTX 4090 llama 8B Q4_0 128 pp4096 2153.01 3354.63 1.56 4882 4882 0
RTX 4090 llama 8B Q4_0 256 pp4096 2835.49 5488.44 1.94 4950 4950 0
RTX 4090 llama 8B Q4_0 512 pp4096 3159.38 7359.91 2.33 5088 5088 0
RTX 4090 llama 8B Q4_0 1024 pp4096 3183.10 8578.46 2.70 5364 5364 0
P40 Mistral 7b Q4_0 16 pp4096 100.54 176.49 1.76 4146 4162 16
P40 Mistral 7b Q4_0 32 pp4096 149.40 333.25 2.23 4148 4166 18
P40 Mistral 7b Q4_0 64 pp4096 208.42 527.25 2.53 4154 4168 14
P40 Mistral 7b Q4_0 128 pp4096 234.56 639.55 2.73 4164 4178 14
P40 Mistral 7b Q4_0 256 pp4096 243.69 725.21 2.98 4188 4200 12
P40 Mistral 7b Q4_0 512 pp4096 239.61 766.93 3.20 4234 4242 8
P40 Mistral 7b Q4_0 1024 pp4096 223.96 760.20 3.39 4326 4326 0

Even at a batch size of 16 dequantizing the KV cache is already faster and the increase in VRAM usage is negligible.

Edit: the numbers are for -ctk q4_0 -ctv q4_0.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 1, 2024
@slaren
Copy link
Member

slaren commented Jun 1, 2024

ggml-cuda/fattn.cu(301): warning #177-D: variable "K" was declared but never referenced
      const ggml_tensor * K = dst->src[1];
                          ^
ggml-cuda/fattn.cu(302): warning #177-D: variable "V" was declared but never referenced
      const ggml_tensor * V = dst->src[2];

@JohannesGaessler JohannesGaessler merged commit 750f60c into ggml-org:master Jun 1, 2024
58 of 66 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants