Hybrid CPU and GPU inference? #12126
Unanswered
billblake2018
asked this question in
Q&A
Replies: 1 comment
-
This might be of interest #4543 it's possible to efficiently distribute work aka shard word between CPU and GPU and achieve tensor parallelization between the two aka have the two work at the exact same time, but it's experimental. There's more info here https://github.com/SJTU-IPADS/PowerInfer |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm using b4762, but I can upgrade at will. On my hardware and according to llama-bench, prompt generation is faster if I use the CPU and 8 threads, but token generation is faster if I use the (Intel Graphics via Vulkan) GPU and 1 thread. Moreover, the token generation speed decreases the more layers I offload to the GPU. Is there a reasonable way to make llama.cpp distribute its work so as to get the best of both worlds or must I stick with either CPU or GPU?
In case it is of interest: With CPU with 8 threads, pp is 21.65 and tg is 6.73. With GPU with 1 thread and 0 layers, it is 17.97 and 3.82. When I offload all layers to the GPU, it is 24.31 and 2.61. So using the GPU really kills token generation speed while at best providing only a modest improvement in prompt processing speed. This is with a llama 3.2 3B Q8 model on an HP ProBook with 32G running FreeBSD 14.1.
Beta Was this translation helpful? Give feedback.
All reactions