-
Notifications
You must be signed in to change notification settings - Fork 11.5k
ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register #12773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @Srihari-mcw! Let me know if there’s anything I can do to help move this forward. Thanks again! |
… the result register
Will try to test this and get back. Thanks |
We tested with the latest changes with llama-bench on AMD Granite Ridge 9600X and observe good performance gains with the latest changes GCC Linux : Q4_0_8_8 Model :
GCC Version = 12.3 The machine supports the following flags by default : system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | The PR looks good from our side. Thanks |
Hey, just a gentle ping. Let me know if there's anything blocking the merge!👀 |
… the result register (ggml-org#12773) * ggml: use _mm[512/256]_dpbusd[_avx]_epi32 to directly accumulate into the result register * simplifies the codebase by removing redundant functions
Benchmark: q4_0 model:
platform: AMD Ryzen 9 9950X (Zen 5) (AVX512VNNI)
Before:
After:
This demonstrates a ~12% real-world speedup in prompt evaluation performance, with no impact on accuracy.
Please let me know if: