[Usage]: Use GGUF model with docker when hf repo has multiple quant versions #8570

mahenning · 2024-09-18T15:51:57Z

Update: I posted the solution below in my next comment.

Your current environment

I skipped the collect_env step as I use the latest docker container v0.6.1.post2 of vllm.

How would you like to use vllm

I want to use a GGUF variant of the Mistral Large Instruct 2407 model with vllm inside a docker container. I followed the docs for setting up a container.
The repos listed under the quantized category of the model are all GGUF, each with multiple different quant versions inside them. Only 2 of the repos have a config.json (this and this). How can I tell vllm which quantized version of a repo I want to use?
Info: I use an A100 80GB.

What I tried:

docker run --gpus all --name vllm -v /mnt/disk1/hf_models:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=<my_token>" -p 8080:8000 --ipc=host vllm/vllm-openai:latest
--model bartowski/Mistral-Large-Instruct-2407-GGUF --tokenizer mistralai/Mistral-Large-Instruct-2407 --gpu-memory-utilization 0.98

Result:

ValueError: No supported config format found in bartowski/Mistral-Large-Instruct-2407-GGUF

Then I tried one of the repos that have a config.json:

docker run --gpus all --name vllm -v /mnt/disk1/hf_models:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=<my_token>" -p 8080:8000 --ipc=host vllm/vllm-openai:latest
--model second-state/Mistral-Large-Instruct-2407-GGUF --tokenizer mistralai/Mistral-Large-Instruct-2407 --gpu-memory-utilization 0.98

Result:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacity of 79.15 GiB of which 213.69 MiB is free. Process 981263 has 78.93 GiB memory in use. [...]

Info: No other process ran on the GPU, the memory was empty before.

So it seems at least that vllm tries to load something. But how can I specifiy which quantized version I want to load? E.g. the q4_K_S variant? I tried giving a link (--model https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF/tree/main/Mistral-Large-Instruct-2407-Q4_K_M), but it seems --model only accepts the HF repo/model format.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

N0ciple · 2024-09-18T16:25:19Z

If you download the right .gguf file from hugging face you can do like so :

docker run --runtime nvidia --gpus=all \
    -v /path/to/your/dot/gguf/models/folder:/models \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model /models/codestral-22b-v0.1.Q4_K.gguf

But you would have to download the .gguf model first and mount a volume containing the said model to the container.
This way it works-ish. In my case, it is complaning that it is missing a chat template, but that is another issue...

Isotr0py · 2024-09-18T16:58:43Z

@N0ciple If you are meeting chat template missing issue, you can try passing --tokenizer mistralai/Codestral-22B-v0.1 to use the tokenizer from source model. The missing chat template from gguf file is a bug in transformers v4.44, which will be fixed in future v4.45.

N0ciple · 2024-09-18T17:28:23Z

@N0ciple If you are meeting chat template missing issue, you can try passing --tokenizer mistralai/Codestral-22B-v0.1 to use the tokenizer from source model. The missing chat template from gguf file is a bug in transformers v4.44, which will be fixed in future v4.45.

Thank you @Isotr0py, I was able to make it run with your help !

For anybody stumbling upon this issue, here is how I run a GGUF model :

docker run --runtime nvidia --gpus=all \
    -v /path/to/your/dot/gguf/models:/models \
    --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model /models/codestral-22b-v0.1.Q4_K.gguf \
    --tokenizer mistralai/Codestral-22B-v0.1

You woud have to download the .gguf file that you want and store it in /path/to/your/dot/gguf/models , modify the model (here /models/codestral-22b-v0.1.Q4_K.gguf) to the one you downloaded and update the tokenizer ( here
mistralai/Codestral-22B-v0.1) to the one from the base model, as pointed out by @Isotr0py.

The .gguf files are in the "Files" tab of a model page (in my case, I downloaded the files from this page : https://huggingface.co/bartowski/Codestral-22B-v0.1-GGUF/tree/main ).

But to second @mahenning it would be nice to be able to specify just the repo and the quantisation Q8_0, Q6_K, Q5_K_M, etc... as a command line argument !

searstream · 2024-09-18T23:23:19Z

Thanks for the above, but any thoughts on how to do models that are multiple parts? ie https://huggingface.co/bartowski/Mistral-Large-Instruct-2407-GGUF/tree/main/Mistral-Large-Instruct-2407-Q4_K_S

Isotr0py · 2024-09-19T06:11:14Z

@searstream Currently, vllm hasn't supported loading multi-files GGUF model. I'm afraid that you might need to merge them with gguf-split tool firstly...

mahenning · 2024-09-19T13:59:01Z

I made it work, here's how:

1. Download the modelfiles

I did it with a python script I found, but maybe you can just dl the files directly.

from huggingface_hub import hf_hub_download

repo_id = "second-state/Mistral-Large-Instruct-2407-GGUF"
filenames = [
    "Mistral-Large-Instruct-2407-Q4_K_S-00001-of-00003.gguf",
    "Mistral-Large-Instruct-2407-Q4_K_S-00002-of-00003.gguf",
    "Mistral-Large-Instruct-2407-Q4_K_S-00003-of-00003.gguf"]

for filename in filenames:
    hf_hub_download(repo_id, filename=filename, local_dir="path/to/models/")

2. Clone the llama.cpp repo to use `gguf-split` for merging the shards

I ran everything on Ubuntu 22.04

navigate to a folder where you want to clone the repo in
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
(install cmake if you don't have it: sudo apt install cmake)
cmake -B build && cmake --build build --config Release (from build docs)
Use the compiled gguf-split tool to merge the model shards to one single file:
./build/bin/llama-gguf-split --merge path/to/models/model-00001-of-0000x.gguf path/to/models/model.gguf (the first argument should be the first model part, the second argument should be the out model name)
Now there should be a model.gguf file in the model path to use.

3. Start the vllm docker container with the model path as volume

docker run --gpus all --name vllm -v path/to/models:/root/.cache/huggingface -p 8080:8000 --ipc=host vllm/vllm-openai:latest --model /root/.cache/huggingface/Mistral-Large-Instruct-2407-Q4_K_S.gguf --tokenizer mistralai/Mistral-Large-Instruct-2407 --gpu-memory-utilization 0.98 --tokenizer_mode "mistral" (in this example I used the Mistral-Large-Instruct-2407 model I downloaded earlier)
I also had to use --max-model-len 35000 to fit it on an A100 GPU. You can set --gpu-memory-utilization to whatever you want/need.

mahenning added the usage How to use vllm label Sep 18, 2024

Isotr0py mentioned this issue Sep 19, 2024

[Doc] Add documentation for GGUF quantization #8618

Merged

mgoin closed this as completed in #8618 Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Use GGUF model with docker when hf repo has multiple quant versions #8570

[Usage]: Use GGUF model with docker when hf repo has multiple quant versions #8570

mahenning commented Sep 18, 2024 •

edited

Loading

N0ciple commented Sep 18, 2024 •

edited

Loading

Isotr0py commented Sep 18, 2024

N0ciple commented Sep 18, 2024 •

edited

Loading

searstream commented Sep 18, 2024

Isotr0py commented Sep 19, 2024

mahenning commented Sep 19, 2024 •

edited

Loading

[Usage]: Use GGUF model with docker when hf repo has multiple quant versions #8570

[Usage]: Use GGUF model with docker when hf repo has multiple quant versions #8570

Comments

mahenning commented Sep 18, 2024 • edited Loading

Update: I posted the solution below in my next comment.

Your current environment

How would you like to use vllm

Before submitting a new issue...

N0ciple commented Sep 18, 2024 • edited Loading

Isotr0py commented Sep 18, 2024

N0ciple commented Sep 18, 2024 • edited Loading

searstream commented Sep 18, 2024

Isotr0py commented Sep 19, 2024

mahenning commented Sep 19, 2024 • edited Loading

1. Download the modelfiles

2. Clone the llama.cpp repo to use gguf-split for merging the shards

3. Start the vllm docker container with the model path as volume

mahenning commented Sep 18, 2024 •

edited

Loading

N0ciple commented Sep 18, 2024 •

edited

Loading

N0ciple commented Sep 18, 2024 •

edited

Loading

mahenning commented Sep 19, 2024 •

edited

Loading

2. Clone the llama.cpp repo to use `gguf-split` for merging the shards