-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Doc] Add documentation for GGUF quantization #8618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 6 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
069fdde
initialize gguf doc
Isotr0py ddc21c3
add gguf script
Isotr0py a14b9b6
fix gguf docs
Isotr0py 6060f98
fix some typos
Isotr0py 2db010f
reorder index
Isotr0py 6c5a21a
polish
Isotr0py d268ef9
fix wrong cmd
Isotr0py d3c562e
show the TP support
Isotr0py 1171ff9
fix typo
Isotr0py File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
.. _gguf: | ||
|
||
GGUF | ||
================== | ||
|
||
.. warning:: | ||
|
||
Please note that GGUF support in vLLM is highly experimental and under-optimized at the moment. Currently, you can use GGUF as a way to reduce memory footprint. If you encounter any issues, please report them to the vLLM team. | ||
|
||
.. warning:: | ||
|
||
Currently, vllm only supports loading single-file GGUF models. If you have a multi-files GGUF model, you can use `gguf-split <https://github.com/ggerganov/llama.cpp/pull/6135>`_ tool to merge them to a single-file model. | ||
|
||
To run a GGUF model with vLLM, you can download and use the local GGUF model from `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF <https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF>`_ with the following command: | ||
|
||
.. code-block:: console | ||
|
||
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf | ||
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion. | ||
$ vllm serve --model ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 | ||
|
||
.. warning:: | ||
|
||
We recommend using the tokenizer from base model instead of GGUF model. Because the tokenizer conversion from GGUF is time-consuming and unstable, especially for some models with large vocab size. | ||
|
||
You can also use the GGUF model directly through the LLM entrypoint: | ||
|
||
.. code-block:: python | ||
|
||
from vllm import LLM, SamplingParams | ||
|
||
# In this script, we demonstrate how to pass input to the chat method: | ||
conversation = [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant" | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Hello" | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": "Hello! How can I assist you today?" | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Write an essay about the importance of higher education.", | ||
}, | ||
] | ||
|
||
# Create a sampling params object. | ||
sampling_params = SamplingParams(temperature=0.8, top_p=0.95) | ||
|
||
# Create an LLM. | ||
llm = LLM(model="./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf", | ||
tokenizer="TinyLlama/TinyLlama-1.1B-Chat-v1.0") | ||
# Generate texts from the prompts. The output is a list of RequestOutput objects | ||
# that contain the prompt, generated text, and other information. | ||
outputs = llm.chat(conversation, sampling_params) | ||
|
||
# Print the outputs. | ||
for output in outputs: | ||
prompt = output.prompt | ||
generated_text = output.outputs[0].text | ||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.