Skip to content

Commit ce45c0d

Browse files
committed
Fix docs, update command
Signed-off-by: Rafael Vasquez <[email protected]>
1 parent 6f638e9 commit ce45c0d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+346
-332
lines changed

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,4 +16,5 @@ make html
1616
```bash
1717
python -m http.server -d build/html/
1818
```
19+
1920
Launch your browser and open localhost:8000.

docs/source/api/multimodal/index.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@ via the `multi_modal_data` field in {class}`vllm.inputs.PromptType`.
1313

1414
Looking to add your own multi-modal model? Please follow the instructions listed [here](#enabling-multimodal-inputs).
1515

16-
1716
## Module Contents
1817

1918
```{eval-rst}

docs/source/api/params.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,3 @@ Optional parameters for vLLM APIs.
1919
.. autoclass:: vllm.PoolingParams
2020
:members:
2121
```
22-

docs/source/community/sponsors.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,15 @@ vLLM is a community project. Our compute resources for development and testing a
66
<!-- Note: Please keep these consistent with README.md. -->
77

88
Cash Donations:
9+
910
- a16z
1011
- Dropbox
1112
- Sequoia Capital
1213
- Skywork AI
1314
- ZhenFund
1415

1516
Compute Resources:
17+
1618
- AMD
1719
- Anyscale
1820
- AWS

docs/source/contributing/overview.md

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,6 @@ pytest tests/
3737
Currently, the repository is not fully checked by `mypy`.
3838
```
3939

40-
# Contribution Guidelines
41-
4240
## Issues
4341

4442
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.

docs/source/deployment/docker.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,8 @@ memory to share data between processes under the hood, particularly for tensor p
2828
You can build and run vLLM from source via the provided <gh-file:Dockerfile>. To build vLLM:
2929

3030
```console
31-
$ # optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
32-
$ DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
31+
# optionally specifies: --build-arg max_jobs=8 --build-arg nvcc_threads=2
32+
DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai
3333
```
3434

3535
```{note}

docs/source/deployment/frameworks/cerebrium.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@ vLLM can be run on a cloud based GPU machine with [Cerebrium](https://www.cerebr
1313
To install the Cerebrium client, run:
1414

1515
```console
16-
$ pip install cerebrium
17-
$ cerebrium login
16+
pip install cerebrium
17+
cerebrium login
1818
```
1919

2020
Next, create your Cerebrium project, run:
2121

2222
```console
23-
$ cerebrium init vllm-project
23+
cerebrium init vllm-project
2424
```
2525

2626
Next, to install the required packages, add the following to your cerebrium.toml:
@@ -58,10 +58,10 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
5858
Then, run the following code to deploy it to the cloud:
5959

6060
```console
61-
$ cerebrium deploy
61+
cerebrium deploy
6262
```
6363

64-
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case` /run`)
64+
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case`/run`)
6565

6666
```python
6767
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \

docs/source/deployment/frameworks/dstack.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,16 @@ vLLM can be run on a cloud based GPU machine with [dstack](https://dstack.ai/),
1313
To install dstack client, run:
1414

1515
```console
16-
$ pip install "dstack[all]
17-
$ dstack server
16+
pip install "dstack[all]
17+
dstack server
1818
```
1919

2020
Next, to configure your dstack project, run:
2121

2222
```console
23-
$ mkdir -p vllm-dstack
24-
$ cd vllm-dstack
25-
$ dstack init
23+
mkdir -p vllm-dstack
24+
cd vllm-dstack
25+
dstack init
2626
```
2727

2828
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:

docs/source/deployment/frameworks/skypilot.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -338,7 +338,7 @@ run: |
338338
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm)
339339
```
340340

341-
2. Then, we can access the GUI at the returned gradio link:
341+
1. Then, we can access the GUI at the returned gradio link:
342342

343343
```console
344344
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live

docs/source/deployment/integrations/llamastack.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-sta
77
To install Llama Stack, run
88

99
```console
10-
$ pip install llama-stack -q
10+
pip install llama-stack -q
1111
```
1212

1313
## Inference using OpenAI Compatible API

docs/source/deployment/k8s.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ Before you begin, ensure that you have the following:
1414

1515
## Deployment Steps
1616

17-
1. **Create a PVC , Secret and Deployment for vLLM**
17+
1. Create a PVC, Secret and Deployment for vLLM
1818

1919
PVC is used to store the model cache and it is optional, you can use hostPath or other storage options
2020

@@ -49,7 +49,7 @@ stringData:
4949
5050
Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.
5151

52-
Here are two examples for using NVIDIA GPU and AMD GPU.
52+
Here are two examples for using NVIDIA GPU and AMD GPU.
5353

5454
- NVIDIA GPU
5555

@@ -194,9 +194,10 @@ spec:
194194
- name: shm
195195
mountPath: /dev/shm
196196
```
197+
197198
You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.
198199

199-
2. **Create a Kubernetes Service for vLLM**
200+
1. Create a Kubernetes Service for vLLM
200201

201202
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
202203

@@ -219,7 +220,7 @@ spec:
219220
type: ClusterIP
220221
```
221222

222-
3. **Deploy and Test**
223+
1. Deploy and Test
223224

224225
Apply the deployment and service configurations using `kubectl apply -f <filename>`:
225226

docs/source/design/automatic_prefix_caching.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,27 +6,24 @@ The core idea of [PagedAttention](#design-paged-attention) is to partition the K
66

77
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
88

9-
```
9+
```text
1010
Block 1 Block 2 Block 3
1111
[A gentle breeze stirred] [the leaves as children] [laughed in the distance]
1212
Block 1: |<--- block tokens ---->|
1313
Block 2: |<------- prefix ------>| |<--- block tokens --->|
1414
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->|
1515
```
1616

17-
1817
In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping:
1918

20-
```
19+
```text
2120
hash(prefix tokens + block tokens) <--> KV Block
2221
```
2322

2423
With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space.
2524

26-
2725
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system.
2826

29-
3027
## Generalized Caching Policy
3128

3229
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full.
@@ -41,5 +38,5 @@ Note that this eviction policy effectively implements the exact policy as in [Ra
4138

4239
However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above:
4340

44-
- Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
45-
- Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.
41+
* Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency.
42+
* Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images.

docs/source/features/quantization/auto_awq.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ The main benefits are lower latency and memory usage.
1515
You can quantize your own models by installing AutoAWQ or picking one of the [400+ models on Huggingface](https://huggingface.co/models?sort=trending&search=awq).
1616

1717
```console
18-
$ pip install autoawq
18+
pip install autoawq
1919
```
2020

2121
After installing AutoAWQ, you are ready to quantize a model. Here is an example of how to quantize `mistralai/Mistral-7B-Instruct-v0.2`:
@@ -47,7 +47,7 @@ print(f'Model is quantized and saved at "{quant_path}"')
4747
To run an AWQ model with vLLM, you can use [TheBloke/Llama-2-7b-Chat-AWQ](https://huggingface.co/TheBloke/Llama-2-7b-Chat-AWQ) with the following command:
4848

4949
```console
50-
$ python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
50+
python examples/offline_inference/llm_engine_example.py --model TheBloke/Llama-2-7b-Chat-AWQ --quantization awq
5151
```
5252

5353
AWQ models are also supported directly through the LLM entrypoint:

docs/source/features/quantization/bnb.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,15 +9,15 @@ Compared to other quantization methods, BitsAndBytes eliminates the need for cal
99
Below are the steps to utilize BitsAndBytes with vLLM.
1010

1111
```console
12-
$ pip install bitsandbytes>=0.45.0
12+
pip install bitsandbytes>=0.45.0
1313
```
1414

1515
vLLM reads the model's config file and supports both in-flight quantization and pre-quantized checkpoint.
1616

1717
You can find bitsandbytes quantized models on <https://huggingface.co/models?other=bitsandbytes>.
1818
And usually, these repositories have a config.json file that includes a quantization_config section.
1919

20-
## Read quantized checkpoint.
20+
## Read quantized checkpoint
2121

2222
```python
2323
from vllm import LLM
@@ -37,10 +37,11 @@ model_id = "huggyllama/llama-7b"
3737
llm = LLM(model=model_id, dtype=torch.bfloat16, trust_remote_code=True, \
3838
quantization="bitsandbytes", load_format="bitsandbytes")
3939
```
40+
4041
## OpenAI Compatible Server
4142

4243
Append the following to your 4bit model arguments:
4344

44-
```
45+
```console
4546
--quantization bitsandbytes --load-format bitsandbytes
4647
```

docs/source/features/quantization/fp8.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ Currently, we load the model at original precision before quantizing down to 8-b
4141
To produce performant FP8 quantized models with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
4242

4343
```console
44-
$ pip install llmcompressor
44+
pip install llmcompressor
4545
```
4646

4747
## Quantization Process
@@ -98,7 +98,7 @@ tokenizer.save_pretrained(SAVE_DIR)
9898
Install `vllm` and `lm-evaluation-harness`:
9999

100100
```console
101-
$ pip install vllm lm-eval==0.4.4
101+
pip install vllm lm-eval==0.4.4
102102
```
103103

104104
Load and run the model in `vllm`:

docs/source/features/quantization/fp8_e4m3_kvcache.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ unquantized model through a quantizer tool (e.g. AMD quantizer or NVIDIA AMMO).
1717
To install AMMO (AlgorithMic Model Optimization):
1818

1919
```console
20-
$ pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
20+
pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo
2121
```
2222

2323
Studies have shown that FP8 E4M3 quantization typically only minimally degrades inference accuracy. The most recent silicon

docs/source/features/quantization/gguf.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -13,16 +13,16 @@ Currently, vllm only supports loading single-file GGUF models. If you have a mul
1313
To run a GGUF model with vLLM, you can download and use the local GGUF model from [TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF](https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF) with the following command:
1414

1515
```console
16-
$ wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
17-
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
18-
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
16+
wget https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
17+
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
18+
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0
1919
```
2020

2121
You can also add `--tensor-parallel-size 2` to enable tensor parallelism inference with 2 GPUs:
2222

2323
```console
24-
$ # We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
25-
$ vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
24+
# We recommend using the tokenizer from base model to avoid long-time and buggy tokenizer conversion.
25+
vllm serve ./tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tensor-parallel-size 2
2626
```
2727

2828
```{warning}

docs/source/features/quantization/int8.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ INT8 computation is supported on NVIDIA GPUs with compute capability > 7.5 (Turi
1616
To use INT8 quantization with vLLM, you'll need to install the [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
1717

1818
```console
19-
$ pip install llmcompressor
19+
pip install llmcompressor
2020
```
2121

2222
## Quantization Process

docs/source/features/spec_decode.md

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -192,11 +192,11 @@ A few important things to consider when using the EAGLE based draft models:
192192

193193
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
194194
used directly with vLLM due to differences in the expected layer names and model definition.
195-
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
195+
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
196196
to convert them. Note that this script does not modify the model's weights.
197197

198198
In the above example, use the script to first convert
199-
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
199+
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
200200
and then use the converted checkpoint as the draft model in vLLM.
201201

202202
2. The EAGLE based draft models need to be run without tensor parallelism
@@ -207,7 +207,6 @@ A few important things to consider when using the EAGLE based draft models:
207207
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
208208
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
209209

210-
211210
A variety of EAGLE draft models are available on the Hugging Face hub:
212211

213212
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
@@ -224,7 +223,6 @@ A variety of EAGLE draft models are available on the Hugging Face hub:
224223
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
225224
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
226225

227-
228226
## Lossless guarantees of Speculative Decoding
229227

230228
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of
@@ -250,17 +248,13 @@ speculative decoding, breaking down the guarantees into three key areas:
250248
same request across runs. For more details, see the FAQ section
251249
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
252250

253-
**Conclusion**
254-
255251
While vLLM strives to ensure losslessness in speculative decoding, variations in generated outputs with and without speculative decoding
256252
can occur due to following factors:
257253

258254
- **Floating-Point Precision**: Differences in hardware numerical precision may lead to slight discrepancies in the output distribution.
259255
- **Batch Size and Numerical Stability**: Changes in batch size may cause variations in logprobs and output probabilities, potentially
260256
due to non-deterministic behavior in batched operations or numerical instability.
261257

262-
**Mitigation Strategies**
263-
264258
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).
265259

266260
## Resources for vLLM contributors

0 commit comments

Comments
 (0)