Skip to content

ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register #12773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 14, 2025

Conversation

SongXiaoXi
Copy link
Contributor

  • Replaced the traditional accumulate-to-zero pattern with direct accumulation into the output register

Benchmark: q4_0 model:

platform: AMD Ryzen 9 9950X (Zen 5) (AVX512VNNI)
Before:

Final estimate: PPL = 8.0754 +/- 0.05511

llama_perf_context_print:        load time =    1175.91 ms
llama_perf_context_print: prompt eval time = 1339282.64 ms / 288768 tokens (    4.64 ms per token,   215.61 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 1341975.38 ms / 288769 tokens

After:

Final estimate: PPL = 8.0754 +/- 0.05511

llama_perf_context_print:        load time =    1181.06 ms
llama_perf_context_print: prompt eval time = 1184860.36 ms / 288768 tokens (    4.10 ms per token,   243.71 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time = 1187563.38 ms / 288769 tokens

This demonstrates a ~12% real-world speedup in prompt evaluation performance, with no impact on accuracy.

Please let me know if:

  • Any naming conventions need to be adjusted
  • Additional tests are preferred

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 5, 2025
@ggerganov ggerganov requested a review from Srihari-mcw April 6, 2025 08:31
@SongXiaoXi
Copy link
Contributor Author

Hi @Srihari-mcw! Let me know if there’s anything I can do to help move this forward. Thanks again!

@Srihari-mcw
Copy link
Collaborator

Will try to test this and get back. Thanks

@Srihari-mcw
Copy link
Collaborator

Srihari-mcw commented Apr 10, 2025

We tested with the latest changes with llama-bench on AMD Granite Ridge 9600X and observe good performance gains with the latest changes

GCC Linux :

Q4_0_8_8 Model :

model size params backend threads test t/s speedup
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 pp 512 103.90 ± 0.15
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 pp 512 125.90 ± 0.33 21.17%
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 tg 128 15.07 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CPU 6 tg 128 15.09 ± 0.00 0.13%

GCC Version = 12.3

The machine supports the following flags by default :

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

The PR looks good from our side. Thanks

@SongXiaoXi
Copy link
Contributor Author

Hey, just a gentle ping. Let me know if there's anything blocking the merge!👀

@ggerganov ggerganov merged commit e959d32 into ggml-org:master Apr 14, 2025
47 of 51 checks passed
colout pushed a commit to colout/llama.cpp that referenced this pull request Apr 21, 2025
… the result register (ggml-org#12773)

* ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register

* simplifies the codebase by removing redundant functions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants