Skip to content

[Usage]: Use GGUF model with docker when hf repo has multiple quant versions #8570

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
mahenning opened this issue Sep 18, 2024 · 6 comments · Fixed by #8618
Closed
1 task done

[Usage]: Use GGUF model with docker when hf repo has multiple quant versions #8570

mahenning opened this issue Sep 18, 2024 · 6 comments · Fixed by #8618
Labels
usage How to use vllm

Comments

@mahenning
Copy link

mahenning commented Sep 18, 2024

Update: I posted the solution below in my next comment.

Your current environment

I skipped the collect_env step as I use the latest docker container v0.6.1.post2 of vllm.

How would you like to use vllm

I want to use a GGUF variant of the Mistral Large Instruct 2407 model with vllm inside a docker container. I followed the docs for setting up a container.
The repos listed under the quantized category of the model are all GGUF, each with multiple different quant versions inside them. Only 2 of the repos have a config.json (this and this). How can I tell vllm which quantized version of a repo I want to use?
Info: I use an A100 80GB.

What I tried:

docker run --gpus all --name vllm -v /mnt/disk1/hf_models:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=<my_token>" -p 8080:8000 --ipc=host vllm/vllm-openai:latest
--model bartowski/Mistral-Large-Instruct-2407-GGUF --tokenizer mistralai/Mistral-Large-Instruct-2407 --gpu-memory-utilization 0.98

Result:

ValueError: No supported config format found in bartowski/Mistral-Large-Instruct-2407-GGUF

Then I tried one of the repos that have a config.json:

docker run --gpus all --name vllm -v /mnt/disk1/hf_models:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=<my_token>" -p 8080:8000 --ipc=host vllm/vllm-openai:latest
--model second-state/Mistral-Large-Instruct-2407-GGUF --tokenizer mistralai/Mistral-Large-Instruct-2407 --gpu-memory-utilization 0.98

Result:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 213.69 MiB is free. Process 981263 has 78.93 GiB memory in use. [...]

Info: No other process ran on the GPU, the memory was empty before.

So it seems at least that vllm tries to load something. But how can I specifiy which quantized version I want to load? E.g. the q4_K_S variant? I tried giving a link (--model https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF/tree/main/Mistral-Large-Instruct-2407-Q4_K_M), but it seems --model only accepts the HF repo/model format.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@mahenning mahenning added the usage How to use vllm label Sep 18, 2024
@N0ciple
Copy link

N0ciple commented Sep 18, 2024

If you download the right .gguf file from hugging face you can do like so :

docker run --runtime nvidia --gpus=all \
    -v /path/to/your/dot/gguf/models/folder:/models \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model /models/codestral-22b-v0.1.Q4_K.gguf 

But you would have to download the .gguf model first and mount a volume containing the said model to the container.
This way it works-ish. In my case, it is complaning that it is missing a chat template, but that is another issue...

@Isotr0py
Copy link
Collaborator

@N0ciple If you are meeting chat template missing issue, you can try passing --tokenizer mistralai/Codestral-22B-v0.1 to use the tokenizer from source model. The missing chat template from gguf file is a bug in transformers v4.44, which will be fixed in future v4.45.

@N0ciple
Copy link

N0ciple commented Sep 18, 2024

@N0ciple If you are meeting chat template missing issue, you can try passing --tokenizer mistralai/Codestral-22B-v0.1 to use the tokenizer from source model. The missing chat template from gguf file is a bug in transformers v4.44, which will be fixed in future v4.45.

Thank you @Isotr0py, I was able to make it run with your help !

For anybody stumbling upon this issue, here is how I run a GGUF model :

docker run --runtime nvidia --gpus=all \
    -v /path/to/your/dot/gguf/models:/models \
    --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model /models/codestral-22b-v0.1.Q4_K.gguf \
    --tokenizer mistralai/Codestral-22B-v0.1

You woud have to download the .gguf file that you want and store it in /path/to/your/dot/gguf/models , modify the model (here /models/codestral-22b-v0.1.Q4_K.gguf) to the one you downloaded and update the tokenizer ( here
mistralai/Codestral-22B-v0.1) to the one from the base model, as pointed out by @Isotr0py.

The .gguf files are in the "Files" tab of a model page (in my case, I downloaded the files from this page : https://huggingface.co/bartowski/Codestral-22B-v0.1-GGUF/tree/main ).

But to second @mahenning it would be nice to be able to specify just the repo and the quantisation Q8_0, Q6_K, Q5_K_M, etc... as a command line argument !

@searstream
Copy link

Thanks for the above, but any thoughts on how to do models that are multiple parts? ie https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF/tree/main/Mistral-Large-Instruct-2407-Q4_K_S

@Isotr0py
Copy link
Collaborator

@searstream Currently, vllm hasn't supported loading multi-files GGUF model. I'm afraid that you might need to merge them with gguf-split tool firstly...

@mahenning
Copy link
Author

mahenning commented Sep 19, 2024

I made it work, here's how:

1. Download the modelfiles

I did it with a python script I found, but maybe you can just dl the files directly.

from huggingface_hub import hf_hub_download

repo_id = "second-state/Mistral-Large-Instruct-2407-GGUF"
filenames = [
    "Mistral-Large-Instruct-2407-Q4_K_S-00001-of-00003.gguf",
    "Mistral-Large-Instruct-2407-Q4_K_S-00002-of-00003.gguf",
    "Mistral-Large-Instruct-2407-Q4_K_S-00003-of-00003.gguf"]

for filename in filenames:
    hf_hub_download(repo_id, filename=filename, local_dir="path/to/models/")

2. Clone the llama.cpp repo to use gguf-split for merging the shards

I ran everything on Ubuntu 22.04

  • navigate to a folder where you want to clone the repo in
  • git clone https://github.com/ggerganov/llama.cpp.git
  • cd llama.cpp
  • (install cmake if you don't have it: sudo apt install cmake)
  • cmake -B build && cmake --build build --config Release (from build docs)
  • Use the compiled gguf-split tool to merge the model shards to one single file:
    ./build/bin/llama-gguf-split --merge path/to/models/model-00001-of-0000x.gguf path/to/models/model.gguf (the first argument should be the first model part, the second argument should be the out model name)
    Now there should be a model.gguf file in the model path to use.

3. Start the vllm docker container with the model path as volume

docker run --gpus all --name vllm -v path/to/models:/root/.cache/huggingface -p 8080:8000 --ipc=host vllm/vllm-openai:latest --model /root/.cache/huggingface/Mistral-Large-Instruct-2407-Q4_K_S.gguf --tokenizer mistralai/Mistral-Large-Instruct-2407 --gpu-memory-utilization 0.98 --tokenizer_mode "mistral" (in this example I used the Mistral-Large-Instruct-2407 model I downloaded earlier)
I also had to use --max-model-len 35000 to fit it on an A100 GPU. You can set --gpu-memory-utilization to whatever you want/need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants