-
Notifications
You must be signed in to change notification settings - Fork 11.5k
ggml-cuda: Adding support for unified memory #8035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Based on what testing methodology? On master I get this performance using
I OOM with 77 layers. With the mode added by this PR I get:
The generation speed can be ~10% faster but the prompt processing speed is ~20x worse so I don't think this would ever be worth using. And if you offload too many layers the thrashing kills the generation performance too. |
Thanks for testing the PR. With UNIFIED MEMORY (PR enabled)
With offloading (master)
My system:GPURTX 4060 16GB VRAM CPUintel 8700k with 6 phisical cores (12 virtual) motherboardAsus PRIME Z390-A Additional info:I also previously tested the same change with two rtx3060 12GB in parallel with a total of 24GB of VRAM Possible explanation:
EDIT: more benchmarks with pp512With unified memory
master
EDIT2:I just noticed that you tested the PR without offloading the full model to GPU. The idea of unified memory is letting the GPU manage the transfer between RAM and VRAM. More benchmarks for comparison: Unified memory enabled
|
Testing your PR on Windows 11, LCPP b3197 + your PR merged and the env variable GGML_CUDA_ENABLE_UNIFIED_MEMORY set on 1, I get this :
It loads with GPU0, GPU1, but not with both. |
I think that this setting is managed directly by the nvidia CUDA driver in windows. Could you try changing the settings referenced in this page? https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/system-memory-fallback-for-stable-diffusion Anyway I am going to install windows on my computer and test this myself in the next days. |
I did that, and when system fallback is disabled, llama.cpp runs properly in bi-GPU. When renabled, problem occurs again. It basically acts indeed like the env variable.
crossed fingers, because I'd love to access the shared memory via Cuda on dual GPU |
RTX 4090, Epyc 7742, 8x 3200 MHz RAM:
P40, Xeon E5-2683 v4, 4x 2133 MHz RAM:
For both hardware configurations I am not able to get better performance than with master. However, if there are people for which this does provide better performance I would still merge this PR since it's a very simple and self-contained changed that is not the default. But please also add some brief documentation to the README that explains how to enable this option. |
RTX 3090, Razen 5950X, 2x 3200 MHz RAM:
|
I just noticed, are you testing with |
Thanks for all your testing. I rerun the benchmark with FORCE_MMQ and I'm getting the same results. My guess is that in my specific hardware configuration the actual computation in the CPU is so slow that streaming the data to the GPU is actually faster. Another difference I can see in our setups is that I'm using IQ quants and you are using Q2_K. Could this be the culprit? I added the README to the PR. Here are the benchmark with FORCE_MMQ
|
I procured a qi2_m model. The best performance I could get on master with RTX 4090, Epyc 7742, and 8x 3200 MHz RAM is:
With this PR I get this performance:
Peak prompt processing performance is higher. Token generation continues to improve but that doesn't OOM anyways and is the same speed as on master. You said you wanted to test Windows. Do you still plan to do that? |
First, thanks again for benchmarking so extensively this PR. From my understanding of your results, you hit a sweet spot at ngl 75 where both PP and TG were higher than in master. Then another sweet spot at ngl 81 where pp is much lower but tg went from the 17.34 of master to 31.09, which is a speedup of 1.79. The PR probably benefits IQ quants more than others. My understanding of the NVIDIA docs is that the windows CUDA driver implements the same functionality of this PR, so the PR wouldn't make any difference in windows. I think a windows test is not warranted. Of course I'm open to other opinions. I could possibly test this on windows this weekend. |
I'm only asking because you said:
If you are not going to do that then I think we should move to finalizing the PR. |
Ok. So I think that the PR is ready. |
Is the total memory used between model + context + compute buffer actually larger than the capacity of the GPU in the cases where it seems to improve performance? This models is 22.46 GiB, so it should fit on a 24GB GPU with very low context. |
Yes. The total memory must be slightly larger than the total memory in the GPU. If it's smaller, then there is no improvement. |
I tried
The total is 23490.41 which is definitely lower than 24GB. |
The reason I ask is because the only way I can see this would work at all is if there is memory fragmentation that prevents the model buffer from being allocated in the first place. Can you test 1e6e363? This change will allocate one buffer per tensor and should avoid any fragmentation issues. |
I don't have the 24GB anymore since I replaced the graphic cards in my PC. Now I have a RTX4060 16GB + RTX3060 12GB. llama-cliHere is the launch command: Here is the relevant portion of the log by llama-cli:
llama-server
Server launch command: Server error log:
Here is the output of nvidia-smi at idle (I have a X11 server running on GPU1, the 12GB one)
I didn't include the logs of both branches because I see no differences. |
im trying to compile the PR because i keep getting oomed (running 2x vega 64 8GB... yeah, i know....) ggml/src/ggml-cuda.cu:563:15: error: use of undeclared identifier 'cudaMallocManaged'; did you mean 'hipMallocManaged'? |
The latest llama.cpp master should already support unified memory on AMD device. |
At the correct location, per the 4th commit of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long radio silence. Since the changes in this PR are comparatively simple I think it's fine to merge as long as the feature is documented and as long as said documentation is not misleading. If the documentation simply says that the option exists and that it enables swapping VRAM to RAM on Linux then I think that would be acceptable. I would only be comfortable with positive performance claims if we can reach consensus regarding the conditions under which the option helps with performance.
I removed the line in the documentation about the performance improvement in that specific case. I prefer not making any performance claim since I don't have enough data. The PR should be ready for merging. |
Co-authored-by: Johannes Gäßler <[email protected]>
* Adding support for unified memory * adding again the documentation about unified memory * refactoring: Moved the unified memory code in the correct location. * Fixed compilation error when using hipblas * cleaning up the documentation * Updating the documentation Co-authored-by: Johannes Gäßler <[email protected]> * adding one more case where the PR should not be enabled --------- Co-authored-by: matteo serva <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>
This adds a environment variable for launching llama.cpp with unified memory on CUDA.
This is useful when the model barely fits in VRAM and inference causes OOM errors.
In that case token generation with unified memory is much faster than partially offloading the model in CPU RAM.
Example: llama-3-70b-IQ2_XS on 24GB of VRAM.
Launch command:
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./llama-server