-
-
Notifications
You must be signed in to change notification settings - Fork 6.9k
[Usage]: Use GGUF model with docker when hf repo has multiple quant versions #8570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If you download the right .gguf file from hugging face you can do like so :
But you would have to download the .gguf model first and mount a volume containing the said model to the container. |
@N0ciple If you are meeting chat template missing issue, you can try passing |
Thank you @Isotr0py, I was able to make it run with your help ! For anybody stumbling upon this issue, here is how I run a GGUF model :
You woud have to download the .gguf file that you want and store it in The .gguf files are in the "Files" tab of a model page (in my case, I downloaded the files from this page : https://huggingface.co/bartowski/Codestral-22B-v0.1-GGUF/tree/main ). But to second @mahenning it would be nice to be able to specify just the repo and the quantisation |
Thanks for the above, but any thoughts on how to do models that are multiple parts? ie https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF/tree/main/Mistral-Large-Instruct-2407-Q4_K_S |
@searstream Currently, vllm hasn't supported loading multi-files GGUF model. I'm afraid that you might need to merge them with gguf-split tool firstly... |
I made it work, here's how: 1. Download the modelfilesI did it with a python script I found, but maybe you can just dl the files directly. from huggingface_hub import hf_hub_download
repo_id = "second-state/Mistral-Large-Instruct-2407-GGUF"
filenames = [
"Mistral-Large-Instruct-2407-Q4_K_S-00001-of-00003.gguf",
"Mistral-Large-Instruct-2407-Q4_K_S-00002-of-00003.gguf",
"Mistral-Large-Instruct-2407-Q4_K_S-00003-of-00003.gguf"]
for filename in filenames:
hf_hub_download(repo_id, filename=filename, local_dir="path/to/models/") 2. Clone the llama.cpp repo to use
|
Update: I posted the solution below in my next comment.
Your current environment
I skipped the collect_env step as I use the latest docker container
v0.6.1.post2
of vllm.How would you like to use vllm
I want to use a GGUF variant of the Mistral Large Instruct 2407 model with vllm inside a docker container. I followed the docs for setting up a container.
The repos listed under the quantized category of the model are all GGUF, each with multiple different quant versions inside them. Only 2 of the repos have a
config.json
(this and this). How can I tell vllm which quantized version of a repo I want to use?Info: I use an A100 80GB.
What I tried:
Result:
Then I tried one of the repos that have a config.json:
Result:
Info: No other process ran on the GPU, the memory was empty before.
So it seems at least that vllm tries to load something. But how can I specifiy which quantized version I want to load? E.g. the q4_K_S variant? I tried giving a link (
--model https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF/tree/main/Mistral-Large-Instruct-2407-Q4_K_M
), but it seems--model
only accepts the HFrepo/model
format.Before submitting a new issue...
The text was updated successfully, but these errors were encountered: