CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 #7681

JohannesGaessler · 2024-06-01T10:02:00Z

The check against __CUDA_ARCH__ for the quantized KV cache is incorrect (> instead of >=). This PR fixes this.

It also implements the dequantization of the KV cache for large batch sizes to FP16 in order to make use of the kernels optimized for large batch sizes. I didn't put this into #7527 because I thought it would cause issues for some cases but it seems to be a straight upgrade:

GPU	Model	Microbatch size	Test	t/s vec dot	t/s dequantize	Speedup	VRAM use vec dot [MiB]	VRAM use dequantize [MiB]	Difference [MiB]
RTX 4090	llama 8B Q4_0	16	pp4096	690.25	1262.91	1.83	4698	4714	16
RTX 4090	llama 8B Q4_0	32	pp4096	1019.16	1487.62	1.46	4706	4722	16
RTX 4090	llama 8B Q4_0	64	pp4096	1364.91	1756.76	1.29	4848	4848	0
RTX 4090	llama 8B Q4_0	128	pp4096	2153.01	3354.63	1.56	4882	4882	0
RTX 4090	llama 8B Q4_0	256	pp4096	2835.49	5488.44	1.94	4950	4950	0
RTX 4090	llama 8B Q4_0	512	pp4096	3159.38	7359.91	2.33	5088	5088	0
RTX 4090	llama 8B Q4_0	1024	pp4096	3183.10	8578.46	2.70	5364	5364	0
P40	Mistral 7b Q4_0	16	pp4096	100.54	176.49	1.76	4146	4162	16
P40	Mistral 7b Q4_0	32	pp4096	149.40	333.25	2.23	4148	4166	18
P40	Mistral 7b Q4_0	64	pp4096	208.42	527.25	2.53	4154	4168	14
P40	Mistral 7b Q4_0	128	pp4096	234.56	639.55	2.73	4164	4178	14
P40	Mistral 7b Q4_0	256	pp4096	243.69	725.21	2.98	4188	4200	12
P40	Mistral 7b Q4_0	512	pp4096	239.61	766.93	3.20	4234	4242	8
P40	Mistral 7b Q4_0	1024	pp4096	223.96	760.20	3.39	4326	4326	0

Even at a batch size of 16 dequantizing the KV cache is already faster and the increase in VRAM usage is negligible.

Edit: the numbers are for -ctk q4_0 -ctv q4_0.

slaren · 2024-06-01T13:40:41Z

ggml-cuda/fattn.cu(301): warning #177-D: variable "K" was declared but never referenced
      const ggml_tensor * K = dst->src[1];
                          ^
ggml-cuda/fattn.cu(302): warning #177-D: variable "V" was declared but never referenced
      const ggml_tensor * V = dst->src[2];

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jun 1, 2024

slaren approved these changes Jun 1, 2024

View reviewed changes

CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8

9002161

JohannesGaessler force-pushed the cuda-kv-dequantize branch from 478dbc9 to 9002161 Compare June 1, 2024 13:46

JohannesGaessler merged commit 750f60c into ggml-org:master Jun 1, 2024
58 of 66 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 #7681

CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 #7681

JohannesGaessler commented Jun 1, 2024 •

edited

Loading

slaren commented Jun 1, 2024

CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 #7681

CUDA: fix Pascal FA, deq. KV to FP16 for batch > 8 #7681

Conversation

JohannesGaessler commented Jun 1, 2024 • edited Loading

slaren commented Jun 1, 2024

JohannesGaessler commented Jun 1, 2024 •

edited

Loading