ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register #12773

SongXiaoXi · 2025-04-05T16:08:40Z

Replaced the traditional accumulate-to-zero pattern with direct accumulation into the output register

Benchmark: q4_0 model:

platform: AMD Ryzen 9 9950X (Zen 5) (AVX512VNNI)
Before:

Final estimate: PPL = 8.0754 +/- 0.05511

llama_perf_context_print:        load time =    1175.91 ms
llama_perf_context_print: prompt eval time = 1339282.64 ms / 288768 tokens (    4.64 ms per token,   215.61 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 1341975.38 ms / 288769 tokens

After:

Final estimate: PPL = 8.0754 +/- 0.05511

llama_perf_context_print:        load time =    1181.06 ms
llama_perf_context_print: prompt eval time = 1184860.36 ms / 288768 tokens (    4.10 ms per token,   243.71 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 1187563.38 ms / 288769 tokens

This demonstrates a ~12% real-world speedup in prompt evaluation performance, with no impact on accuracy.

Please let me know if:

Any naming conventions need to be adjusted
Additional tests are preferred

SongXiaoXi · 2025-04-09T13:49:58Z

Hi @Srihari-mcw! Let me know if there’s anything I can do to help move this forward. Thanks again!

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

… the result register

Srihari-mcw · 2025-04-10T01:22:08Z

Will try to test this and get back. Thanks

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp

Srihari-mcw · 2025-04-10T15:17:57Z

We tested with the latest changes with llama-bench on AMD Granite Ridge 9600X and observe good performance gains with the latest changes

GCC Linux :

Q4_0_8_8 Model :

model	size	params	backend	threads	test	t/s	speedup
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	pp 512	103.90 ± 0.15
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	pp 512	125.90 ± 0.33	21.17%
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	tg 128	15.07 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	6	tg 128	15.09 ± 0.00	0.13%

GCC Version = 12.3

The machine supports the following flags by default :

The PR looks good from our side. Thanks

SongXiaoXi · 2025-04-14T02:06:19Z

Hey, just a gentle ping. Let me know if there's anything blocking the merge!👀

… the result register (ggml-org#12773) * ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register * simplifies the codebase by removing redundant functions

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 5, 2025

ggerganov requested a review from Srihari-mcw April 6, 2025 08:31

SongXiaoXi force-pushed the master branch from cd7dfd3 to c46d603 Compare April 6, 2025 13:15

slaren reviewed Apr 9, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Outdated Show resolved Hide resolved

ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into…

6cd4369

… the result register

SongXiaoXi force-pushed the master branch from c46d603 to 6cd4369 Compare April 10, 2025 01:14

Srihari-mcw reviewed Apr 10, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Outdated Show resolved Hide resolved

simplifies the codebase by removing redundant functions

e7c912a

Srihari-mcw approved these changes Apr 10, 2025

View reviewed changes

ggerganov merged commit e959d32 into ggml-org:master Apr 14, 2025
47 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register #12773

ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register #12773

SongXiaoXi commented Apr 5, 2025

SongXiaoXi commented Apr 9, 2025

Srihari-mcw commented Apr 10, 2025

Srihari-mcw commented Apr 10, 2025 •

edited

Loading

SongXiaoXi commented Apr 14, 2025

ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register #12773

ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register #12773

Conversation

SongXiaoXi commented Apr 5, 2025

Benchmark: q4_0 model:

SongXiaoXi commented Apr 9, 2025

Srihari-mcw commented Apr 10, 2025

Srihari-mcw commented Apr 10, 2025 • edited Loading

SongXiaoXi commented Apr 14, 2025

Srihari-mcw commented Apr 10, 2025 •

edited

Loading