Hybrid CPU and GPU inference? #12126

billblake2018 · 2025-03-01T09:24:05Z

billblake2018
Mar 1, 2025

I'm using b4762, but I can upgrade at will. On my hardware and according to llama-bench, prompt generation is faster if I use the CPU and 8 threads, but token generation is faster if I use the (Intel Graphics via Vulkan) GPU and 1 thread. Moreover, the token generation speed decreases the more layers I offload to the GPU. Is there a reasonable way to make llama.cpp distribute its work so as to get the best of both worlds or must I stick with either CPU or GPU?

In case it is of interest: With CPU with 8 threads, pp is 21.65 and tg is 6.73. With GPU with 1 thread and 0 layers, it is 17.97 and 3.82. When I offload all layers to the GPU, it is 24.31 and 2.61. So using the GPU really kills token generation speed while at best providing only a modest improvement in prompt processing speed. This is with a llama 3.2 3B Q8 model on an HP ProBook with 32G running FreeBSD 14.1.

ejrydhfs · 2025-03-02T10:53:42Z

ejrydhfs
Mar 2, 2025

This might be of interest #4543 it's possible to efficiently distribute work aka shard word between CPU and GPU and achieve tensor parallelization between the two aka have the two work at the exact same time, but it's experimental. There's more info here https://github.com/SJTU-IPADS/PowerInfer

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hybrid CPU and GPU inference? #12126

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Hybrid CPU and GPU inference? #12126

billblake2018 Mar 1, 2025

Replies: 1 comment

ejrydhfs Mar 2, 2025

billblake2018
Mar 1, 2025

ejrydhfs
Mar 2, 2025