Why does GGML_VK_PREFER_HOST_MEMORY help with iGPUs when using Vulkan? #12770

ddpasa · 2025-04-05T10:16:07Z

ddpasa
Apr 5, 2025

I have an intel iGPU that gives a nice performance boost when processing the prompt or an image. However, the iGPU can use around less than half of the available system memory, which results in frequent out of memory issues. These usually manifest themselves as Vulkan device lost errors.

However, setting GGML_VK_PREFER_HOST_MEMORY=1 environment variable seems to allow the iGPU to access the whole system memory, and helps enormously. There also isn't any performance drawback as far as I can tell with llama-bench.

This is nice, but I would like to understand why it is happening. Maybe whatever this option is doing should be set as the default behaviour for iGPUs?

Some system info is below:

$ inxi -G
Graphics:
  Device-1: Intel Iris Plus Graphics G7 driver: i915 v: kernel
  Display: x11 server: X.Org v: 21.1.16 with: Xwayland v: 24.1.6 driver: X:
    loaded: intel dri: i965 gpu: i915 resolution: 1600x900_60.00~60Hz
  API: EGL v: 1.5 drivers: iris,swrast platforms: gbm,x11,surfaceless,device
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: intel mesa v: 25.0.2-arch1.2
    renderer: Mesa Intel Iris Plus Graphics (ICL GT2)
  API: Vulkan v: 1.4.309 drivers: N/A surfaces: xcb,xlib

output of vulkaninfo is attached.
vulkaninfo.txt

Tagging @wbruna because he added this option to llama.cpp.

Answered by wbruna

Apr 5, 2025

The PR has a bit more info about the option: #11592 . But in a nutshell: for some inference operations, back-and-forth data transfers between dedicated VRAM (faster) and host shared (slower) memory may end up much slower than just using host shared memory for everything. That env var just reverses the default logic "allocate dedicated VRAM, with host shared as fallback" to "allocate host shared memory, with dedicated as fallback".

And it's not the default because systems that could comfortably allocate everything on dedicated VRAM would take a big performance hit, while potentially under-utilizing VRAM and leaving less general memory available for other applications. The allocation heuris…

View full answer

wbruna · 2025-04-05T15:20:04Z

wbruna
Apr 5, 2025

The PR has a bit more info about the option: #11592 . But in a nutshell: for some inference operations, back-and-forth data transfers between dedicated VRAM (faster) and host shared (slower) memory may end up much slower than just using host shared memory for everything. That env var just reverses the default logic "allocate dedicated VRAM, with host shared as fallback" to "allocate host shared memory, with dedicated as fallback".

And it's not the default because systems that could comfortably allocate everything on dedicated VRAM would take a big performance hit, while potentially under-utilizing VRAM and leaving less general memory available for other applications. The allocation heuristics could probably be improved, but it's far from a trivial problem because it'd need to take into account both available device memory and model characteristics (in my specific case, I had a performance hit on 9b models, while both 8b and 11b worked fine).

Anyway, it's strange that an allocation failure would cause a driver failure, instead of simply hitting the fallback path. Does it seems related to total memory usage (model and context size, number of offloaded layers)?

0 replies

ddpasa · 2025-04-06T11:08:08Z

ddpasa
Apr 6, 2025
Author

Thanks @wbruna this is a very helpful explanation!

For iGPUs with uma that share host and graphics memory, does it make sense where the memory is allocated? For discrete GPUs that require explicit copies between host and device I can understand that the location of the memory is very important. But for UMA I would have expected it not to matter. Am I missing an important detail here?

1 reply

0cc4m Apr 7, 2025
Collaborator

There are still differences between RAM that is dedicated to the iGPU (which shows up as device memory) and RAM it has to share with the CPU. In theory it should always be better to use memory that is not shared, but practice has shown that this is not always the case (hence the option @wbruna added). I don't know the reason for it, only someone very familiar with the inner workings of affected drivers could explain that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does GGML_VK_PREFER_HOST_MEMORY help with iGPUs when using Vulkan? #12770

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Why does GGML_VK_PREFER_HOST_MEMORY help with iGPUs when using Vulkan? #12770

ddpasa Apr 5, 2025

Replies: 2 comments · 1 reply

wbruna Apr 5, 2025

ddpasa Apr 6, 2025 Author

0cc4m Apr 7, 2025 Collaborator

ddpasa
Apr 5, 2025

Replies: 2 comments 1 reply

wbruna
Apr 5, 2025

ddpasa
Apr 6, 2025
Author

0cc4m Apr 7, 2025
Collaborator