vulkan: use aligned loads for flash attention mask #12853

jeffbolznv · 2025-04-09T14:36:16Z

Rewrite the stride logic for the mask tensor in the FA shader to force the stride to be aligned, to allow using more efficient loads.

before:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3-8b-instruct.Q4_K_M.gguf -p 4096,8192,16384 -n 0 -fa 1 --repetitions 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |        pp4096 |       3340.34 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |        pp8192 |       3075.69 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |       pp16384 |       2599.02 ± 0.00 |

after:

Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench.exe -m C:\models\meta-llama-3-8b-instruct.Q4_K_M.gguf -p 4096,8192,16384 -n 0 -fa 1 --repetitions 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4070 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
| model                          |       size |     params | backend    | ngl | fa |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |        pp4096 |       3378.30 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |        pp8192 |       3093.57 ± 0.00 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | Vulkan     |  99 |  1 |       pp16384 |       2681.14 ± 0.00 |

Rewrite the stride logic for the mask tensor in the FA shader to force the stride to be aligned, to allow using more efficient loads.

vulkan: use aligned loads for flash attention mask

6c379e4

Rewrite the stride logic for the mask tensor in the FA shader to force the stride to be aligned, to allow using more efficient loads.

jeffbolznv requested a review from 0cc4m April 9, 2025 14:36

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Apr 9, 2025

0cc4m approved these changes Apr 12, 2025

View reviewed changes

0cc4m merged commit a483757 into ggml-org:master Apr 12, 2025
47 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: use aligned loads for flash attention mask #12853

vulkan: use aligned loads for flash attention mask #12853

jeffbolznv commented Apr 9, 2025

vulkan: use aligned loads for flash attention mask #12853

vulkan: use aligned loads for flash attention mask #12853

Conversation

jeffbolznv commented Apr 9, 2025