Help with docker command to run llama3.1 8b Q6 quantised model with llama cpp to utilise GPU to the full potential #12307
Unanswered
vishnuthegeek
asked this question in
Q&A
Replies: 1 comment
-
Any help would be much appreciated. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have a gpu machine Standard_NC24ads_A100_v4 with 80 GB of GPU Accelerator-Memory need help with running a llama3.1 8b Q6 quantised model which could utilise full potential of GPU machine and extract the best performance out of it. Could someone help me with the command below, and suggest changes to best utilise my GPU pls?
Current docker start command:
sudo docker run --gpus all -d --name project_name --network llama-net -p 8080:8080 -v /llama/models/llama3:/models --restart unless-stopped local/llama.cpp:server-cuda -m /models/Meta-Llama-3.1-8B-Instruct-Q6_K_L.gguf --port 8080 --host 0.0.0.0 --n-gpu-layers 70 -c 16240 --parallel 5 -n 4096 -t 80 --batch-size 4096 --ubatch-size 1024
Beta Was this translation helpful? Give feedback.
All reactions