metal: implement flash attention kernel for quantized KV cache #9735

FanShupei · 2024-10-04T07:54:39Z

This PR is mainly for discussion. The strategy and code quality is far from being merged.

To support quantized KV cache. I write a new FA kernel similar to kernel_flash_attn_ext_vec_f16 and add dequantization support. Since 'kernel_flash_attn_ext_vec_f16' use vce4 extensively thus forces D is at least 128. I write a new version using only scalars, then D is only required to be multiple of 32.

I only implement ctk = ctv = q8_0 as a proof of concept .The code is generic and support of other formats could be added easily.

measurement (before this PR)

model	size	params	backend	ngl	fa	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	pp64	269.44 ± 0.22
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	pp128	276.87 ± 0.08
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	pp512	280.58 ± 0.06
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	pp2048	264.88 ± 0.02
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	tg128	30.97 ± 0.03

measurement (after this PR)

For measurement purpose, this PR forces all FA code path uses the new 'kernel_flash_attn_ext_scalar_f16'.

I observe that prefill slows down severely in long input case (131tok/s vs 265tok/s when pp2048).

model	size	params	backend	ngl	fa	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	pp64	227.37 ± 1.04
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	pp128	233.13 ± 0.11
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	pp512	198.81 ± 0.09
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	pp2048	131.52 ± 0.03
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	1	tg128	31.00 ± 0.03

model	size	params	backend	ngl	type_k	type_v	fa	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	q8_0	q8_0	1	pp64	226.74 ± 0.95
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	q8_0	q8_0	1	pp128	231.97 ± 0.21
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	q8_0	q8_0	1	pp512	193.76 ± 0.05
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	q8_0	q8_0	1	pp2048	126.06 ± 0.01
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	99	q8_0	q8_0	1	tg128	30.84 ± 0.01

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

FanShupei added 2 commits October 3, 2024 23:02

[metal-kernel] add flash_attn_ext_scalar_f16 implementation

9e62e7e

[metal] (HACK!!!) force use kernel_flash_attn_ext_scalar_f16 in FA

d436f5b

FanShupei mentioned this pull request Oct 4, 2024

Feature Request: [metal] implement FA kernels for quantized KV cache #9736

Closed

4 tasks

ggerganov mentioned this pull request Nov 4, 2024

metal : add quantized FA support #10149

Merged

2 tasks

FanShupei closed this Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal: implement flash attention kernel for quantized KV cache #9735

metal: implement flash attention kernel for quantized KV cache #9735

FanShupei commented Oct 4, 2024

metal: implement flash attention kernel for quantized KV cache #9735

metal: implement flash attention kernel for quantized KV cache #9735

Conversation

FanShupei commented Oct 4, 2024

measurement (before this PR)

measurement (after this PR)