Skip to content

Commit 3e3faab

Browse files
Isotr0pykwang1012
authored andcommitted
[Doc] Add documentation for GGUF quantization (vllm-project#8618)
1 parent f2bc45f commit 3e3faab

File tree

2 files changed

+74
-0
lines changed

2 files changed

+74
-0
lines changed

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ Documentation
107107
quantization/supported_hardware
108108
quantization/auto_awq
109109
quantization/bnb
110+
quantization/gguf
110111
quantization/int8
111112
quantization/fp8
112113
quantization/fp8_e5m2_kvcache

docs/source/quantization/gguf.rst

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
.. _gguf:
2+
3+
GGUF
4+
==================
5+
6+
.. warning::
7+
8+
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment, it might be incompatible with other features. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team.
9+
10+
.. warning::
11+
12+
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use `gguf-split <https://github.com/ggerganov/llama.cpp/pull/6135>`_ tool to merge them to a single-file model.
13+
14+
To run a GGUF model with vLLM, you can download and use the local GGUF model from `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF <https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF>`_ with the following command:
15+
16+
.. code-block:: console
17+
18+
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
19+
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
20+
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
21+
22+
You can also add ``--tensor-parallel-size 2`` to enable tensor parallelism inference with 2 GPUs:
23+
24+
.. code-block:: console
25+
26+
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
27+
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
28+
29+
.. warning::
30+
31+
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size.
32+
33+
You can also use the GGUF model directly through the LLM entrypoint:
34+
35+
.. code-block:: python
36+
37+
from vllm import LLM, SamplingParams
38+
39+
# In this script, we demonstrate how to pass input to the chat method:
40+
conversation = [
41+
{
42+
"role": "system",
43+
"content": "You are a helpful assistant"
44+
},
45+
{
46+
"role": "user",
47+
"content": "Hello"
48+
},
49+
{
50+
"role": "assistant",
51+
"content": "Hello! How can I assist you today?"
52+
},
53+
{
54+
"role": "user",
55+
"content": "Write an essay about the importance of higher education.",
56+
},
57+
]
58+
59+
# Create a sampling params object.
60+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
61+
62+
# Create an LLM.
63+
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf",
64+
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
65+
# Generate texts from the prompts. The output is a list of RequestOutput objects
66+
# that contain the prompt, generated text, and other information.
67+
outputs = llm.chat(conversation, sampling_params)
68+
69+
# Print the outputs.
70+
for output in outputs:
71+
prompt = output.prompt
72+
generated_text = output.outputs[0].text
73+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

0 commit comments

Comments
 (0)