Skip to content

Commit d427e5c

Browse files
[Doc] Minor documentation fixes (#11580)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent 42bb201 commit d427e5c

13 files changed

+27
-25
lines changed

docs/source/contributing/dockerfile/dockerfile.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,11 +11,11 @@ Below is a visual representation of the multi-stage Dockerfile. The build graph
1111

1212
The edges of the build graph represent:
1313

14-
- FROM ... dependencies (with a solid line and a full arrow head)
14+
- `FROM ...` dependencies (with a solid line and a full arrow head)
1515

16-
- COPY --from=... dependencies (with a dashed line and an empty arrow head)
16+
- `COPY --from=...` dependencies (with a dashed line and an empty arrow head)
1717

18-
- RUN --mount=(.\*)from=... dependencies (with a dotted line and an empty diamond arrow head)
18+
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)
1919

2020
> ```{figure} ../../assets/dev/dockerfile-stages-dependency.png
2121
> :align: center

docs/source/contributing/overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ pytest tests/
3434
```
3535

3636
```{note}
37-
Currently, the repository does not pass the `mypy` tests.
37+
Currently, the repository is not fully checked by `mypy`.
3838
```
3939

4040
# Contribution Guidelines

docs/source/getting_started/arm-installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ Contents:
2020
## Requirements
2121

2222
- **Operating System**: Linux or macOS
23-
- **Compiler**: gcc/g++ >= 12.3.0 (optional, but recommended)
23+
- **Compiler**: `gcc/g++ >= 12.3.0` (optional, but recommended)
2424
- **Instruction Set Architecture (ISA)**: NEON support is required
2525

2626
(arm-backend-quick-start-dockerfile)=

docs/source/getting_started/cpu-installation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Table of contents:
2424
## Requirements
2525

2626
- OS: Linux
27-
- Compiler: gcc/g++>=12.3.0 (optional, recommended)
27+
- Compiler: `gcc/g++>=12.3.0` (optional, recommended)
2828
- Instruction set architecture (ISA) requirement: AVX512 (optional, recommended)
2929

3030
(cpu-backend-quick-start-dockerfile)=
@@ -69,7 +69,7 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install
6969

7070
```{note}
7171
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, will brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
72-
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable VLLM_CPU_AVX512BF16=1 before the building.
72+
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
7373
```
7474

7575
(env-intro)=

docs/source/getting_started/gaudi-installation.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -167,6 +167,8 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
167167
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
168168
```
169169

170+
(gaudi-bucketing-mechanism)=
171+
170172
### Bucketing mechanism
171173

172174
Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
@@ -185,7 +187,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, ma
185187
INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
186188
```
187189

188-
`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling - `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.
190+
`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.
189191

190192
Example (with ramp-up)
191193

@@ -214,7 +216,7 @@ If a request exceeds maximum bucket size in any dimension, it will be processed
214216
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.
215217

216218
```{note}
217-
Bucketing is transparent to a client - padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
219+
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
218220
```
219221

220222
### Warmup
@@ -235,7 +237,7 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size
235237
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
236238
```
237239

238-
This example uses the same buckets as in *Bucketing mechanism* section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
240+
This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
239241

240242
```{tip}
241243
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.

docs/source/getting_started/neuron-installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ Installation steps:
2626
(build-from-source-neuron)=
2727

2828
```{note}
29-
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with vLLM >= 0.5.3. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
29+
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
3030
```
3131

3232
## Build from source

docs/source/getting_started/quickstart.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ $ "temperature": 0
114114
$ }'
115115
```
116116

117-
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` python package:
117+
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:
118118

119119
```python
120120
from openai import OpenAI
@@ -151,7 +151,7 @@ $ ]
151151
$ }'
152152
```
153153

154-
Alternatively, you can use the `openai` python package:
154+
Alternatively, you can use the `openai` Python package:
155155

156156
```python
157157
from openai import OpenAI

docs/source/getting_started/tpu-installation.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -103,7 +103,7 @@ Connect to your TPU using SSH:
103103
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
104104
```
105105

106-
Install Miniconda
106+
Install Miniconda:
107107

108108
```bash
109109
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

docs/source/models/supported_models.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -435,7 +435,7 @@ despite being described otherwise on its model card.
435435
```
436436

437437
If your model is not in the above list, we will try to automatically convert the model using
438-
:func:`vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
438+
{func}`vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
439439
of the whole prompt are extracted from the normalized hidden state corresponding to the last token.
440440

441441
#### Reward Modeling (`--task reward`)
@@ -468,7 +468,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding
468468
```
469469

470470
If your model is not in the above list, we will try to automatically convert the model using
471-
:func:`vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.
471+
{func}`vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.
472472

473473
```{important}
474474
For process-supervised reward models such as {code}`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
@@ -500,7 +500,7 @@ e.g.: {code}`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 1
500500
```
501501

502502
If your model is not in the above list, we will try to automatically convert the model using
503-
:func:`vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
503+
{func}`vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
504504

505505
#### Sentence Pair Scoring (`--task score`)
506506

docs/source/serving/deploying_with_cerebrium.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
3333
vllm = "latest"
3434
```
3535

36-
Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py\`:
36+
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:
3737

3838
```python
3939
from vllm import LLM, SamplingParams
@@ -55,13 +55,13 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
5555
return {"results": results}
5656
```
5757

58-
Then, run the following code to deploy it to the cloud
58+
Then, run the following code to deploy it to the cloud:
5959

6060
```console
6161
$ cerebrium deploy
6262
```
6363

64-
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
64+
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case` /run`)
6565

6666
```python
6767
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \

docs/source/serving/deploying_with_dstack.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ $ cd vllm-dstack
2525
$ dstack init
2626
```
2727

28-
Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
28+
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
2929

3030
```yaml
3131
type: service

docs/source/serving/distributed_serving.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Before going into the details of distributed inference and serving, let's first
88

99
- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
1010
- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
11-
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
11+
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
1212

1313
In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.
1414

@@ -77,15 +77,15 @@ Then you get a ray cluster of containers. Note that you need to keep the shells
7777

7878
Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.
7979

80-
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
80+
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
8181

8282
```console
8383
$ vllm serve /path/to/the/model/in/the/container \
8484
$ --tensor-parallel-size 8 \
8585
$ --pipeline-parallel-size 2
8686
```
8787

88-
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16:
88+
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:
8989

9090
```console
9191
$ vllm serve /path/to/the/model/in/the/container \

docs/source/serving/runai_model_streamer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ For reading from S3, it will be the number of client instances the host is openi
4141
$ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'
4242
```
4343

44-
You can controls the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
44+
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
4545
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).
4646

4747
```console

0 commit comments

Comments
 (0)