Help with docker command to run llama3.1 8b Q6 quantised model with llama cpp to utilise GPU to the full potential #12307

vishnuthegeek · 2025-03-10T13:07:59Z

vishnuthegeek
Mar 10, 2025

I have a gpu machine Standard_NC24ads_A100_v4 with 80 GB of GPU Accelerator-Memory need help with running a llama3.1 8b Q6 quantised model which could utilise full potential of GPU machine and extract the best performance out of it. Could someone help me with the command below, and suggest changes to best utilise my GPU pls?

Current docker start command:
sudo docker run --gpus all -d --name project_name --network llama-net -p 8080:8080 -v /llama/models/llama3:/models --restart unless-stopped local/llama.cpp:server-cuda -m /models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --port 8080 --host 0.0.0.0 --n-gpu-layers 70 -c 16240 --parallel 5 -n 4096 -t 80 --batch-size 4096 --ubatch-size 1024

vishnuthegeek · 2025-03-13T14:21:14Z

vishnuthegeek
Mar 13, 2025
Author

Any help would be much appreciated.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with docker command to run llama3.1 8b Q6 quantised model with llama cpp to utilise GPU to the full potential #12307

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Help with docker command to run llama3.1 8b Q6 quantised model with llama cpp to utilise GPU to the full potential #12307

vishnuthegeek Mar 10, 2025

Replies: 1 comment

vishnuthegeek Mar 13, 2025 Author

vishnuthegeek
Mar 10, 2025

vishnuthegeek
Mar 13, 2025
Author