-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Vulkan Improvements #5835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Improvements #5835
Conversation
Remove unnecessary SPIR-V shader duplication
Fix backend free bug
Basic q4_0 mmq shader and unit test
…ax buffer size Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency
Q4_0 benchmarks on gfx1030/radv
|
@0cc4m what about IQ quants? |
Sure, but they're quite a bit of work to implement. I'll get to them eventually, but especially MoE takes precedence. |
Here are some benchmarks: Vulkan0: AMD Radeon Pro VII (RADV VEGA20) | uma: 0 | fp16: 1 | warp size: 64
Vulkan0: AMD Radeon RX 6800 XT (RADV NAVI21) | uma: 0 | fp16: 1 | warp size: 64
(ROCm failed to run on that PC for whatever reason) Vulkan0: Intel(R) Arc(tm) A770 Graphics (DG2) | uma: 0 | fp16: 1 | warp size: 32
Vulkan0: NVIDIA GeForce RTX 3090 | uma: 0 | fp16: 1 | warp size: 32
|
Here are my numbers: I've only tested the known-to-be-much-slower quants + the Q4_K_S known to be already faster than ROCm. Very nice speed improvements all across the board. Vulkan0: AMD Radeon RX 6700 XT (RADV NAVI22) | uma: 0 | fp16: 1 | warp size: 64
|
Couldn't resist testing the commonly recommended Q4_K_M and Q5_K_M, edited my table. Q4_K_* tg is epic, Q5_K_M tg is excellent. |
Here are some benchmarks on an AMD Radeon RX 5700 XT, the results are quite impressive since the prompt processing is now faster than ROCm on q4_0 (ROCm is not particularly optimized for this card so that may be the reason for the not so impressive results). Model: llama 2 Vulkan0: AMD Radeon RX 5700 XT (RADV NAVI10) | uma: 0 | fp16: 1 | warp size: 64
For the remaining quants it seems like the improvements are mostly related to the token generation while the prompt processing has quite low but consistent gains compared to the master branch. |
@ggerganov @slaren Can one of you approve the minimal change to llama.cpp? The flake8 linting issue isn't coming from my changes. |
* Improve dequant shaders, add fast q4_0 dequant * Optimize dmmv non-kquants for GCN Remove unnecessary SPIR-V shader duplication * Fix q4_0 dequant dispatch sizes Fix backend free bug * Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0 * Add unary and binary op shader templates * Fix Vulkan check results * Enable non-contiguous support for simple ops * Add argsort Basic q4_0 mmq shader and unit test * Speed up q4_0 dequant code, enable mmq for q4_0 * Rework matmul pipeline selection * Add soft_max alibi support * Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders * Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency
* Improve dequant shaders, add fast q4_0 dequant * Optimize dmmv non-kquants for GCN Remove unnecessary SPIR-V shader duplication * Fix q4_0 dequant dispatch sizes Fix backend free bug * Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0 * Add unary and binary op shader templates * Fix Vulkan check results * Enable non-contiguous support for simple ops * Add argsort Basic q4_0 mmq shader and unit test * Speed up q4_0 dequant code, enable mmq for q4_0 * Rework matmul pipeline selection * Add soft_max alibi support * Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders * Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency
This probably wasn't the focus, but FYI, things are still pretty broken with Adreno. Finding a compute queue is fixed, but dequant_q4k and dequant_q5k will still choke Adreno (unknown error) on creating the pipeline, and aside from that, this backend will still die with DeviceLost on submit() if more than a few layers are offloaded to GPU. That will still happen if the new envar is restricting the max allocation too. |
* Improve dequant shaders, add fast q4_0 dequant * Optimize dmmv non-kquants for GCN Remove unnecessary SPIR-V shader duplication * Fix q4_0 dequant dispatch sizes Fix backend free bug * Optimize dequant shaders for q4_1, q5_0, q5_1 and q8_0 * Add unary and binary op shader templates * Fix Vulkan check results * Enable non-contiguous support for simple ops * Add argsort Basic q4_0 mmq shader and unit test * Speed up q4_0 dequant code, enable mmq for q4_0 * Rework matmul pipeline selection * Add soft_max alibi support * Add q4_1, q5_0, q5_1 and q8_0 dequant mat mat mul shaders * Add environment variable GGML_VK_FORCE_MAX_ALLOCATION_SIZE to limit max buffer size Rename GGML_VULKAN_DISABLE_F16 to GGML_VK_DISABLE_F16 for consistency
Here's a batch of Vulkan improvements:
I'll add some benchmarks of my GPUs later. Let me know what you think.