-
I have an intel iGPU that gives a nice performance boost when processing the prompt or an image. However, the iGPU can use around less than half of the available system memory, which results in frequent out of memory issues. These usually manifest themselves as Vulkan device lost errors. However, setting GGML_VK_PREFER_HOST_MEMORY=1 environment variable seems to allow the iGPU to access the whole system memory, and helps enormously. There also isn't any performance drawback as far as I can tell with llama-bench. This is nice, but I would like to understand why it is happening. Maybe whatever this option is doing should be set as the default behaviour for iGPUs? Some system info is below:
output of vulkaninfo is attached. Tagging @wbruna because he added this option to llama.cpp. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
The PR has a bit more info about the option: #11592 . But in a nutshell: for some inference operations, back-and-forth data transfers between dedicated VRAM (faster) and host shared (slower) memory may end up much slower than just using host shared memory for everything. That env var just reverses the default logic "allocate dedicated VRAM, with host shared as fallback" to "allocate host shared memory, with dedicated as fallback". And it's not the default because systems that could comfortably allocate everything on dedicated VRAM would take a big performance hit, while potentially under-utilizing VRAM and leaving less general memory available for other applications. The allocation heuristics could probably be improved, but it's far from a trivial problem because it'd need to take into account both available device memory and model characteristics (in my specific case, I had a performance hit on 9b models, while both 8b and 11b worked fine). Anyway, it's strange that an allocation failure would cause a driver failure, instead of simply hitting the fallback path. Does it seems related to total memory usage (model and context size, number of offloaded layers)? |
Beta Was this translation helpful? Give feedback.
-
Thanks @wbruna this is a very helpful explanation! For iGPUs with uma that share host and graphics memory, does it make sense where the memory is allocated? For discrete GPUs that require explicit copies between host and device I can understand that the location of the memory is very important. But for UMA I would have expected it not to matter. Am I missing an important detail here? |
Beta Was this translation helpful? Give feedback.
The PR has a bit more info about the option: #11592 . But in a nutshell: for some inference operations, back-and-forth data transfers between dedicated VRAM (faster) and host shared (slower) memory may end up much slower than just using host shared memory for everything. That env var just reverses the default logic "allocate dedicated VRAM, with host shared as fallback" to "allocate host shared memory, with dedicated as fallback".
And it's not the default because systems that could comfortably allocate everything on dedicated VRAM would take a big performance hit, while potentially under-utilizing VRAM and leaving less general memory available for other applications. The allocation heuris…