Skip to content

Commit 4cb9dce

Browse files
DarkLight1337rasmith
authored andcommitted
[Doc][3/N] Reorganize Serving section (vllm-project#11766)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent b7500cd commit 4cb9dce

40 files changed

+248
-133
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ pip install vllm
7777
Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to learn more.
7878
- [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html)
7979
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
80-
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
80+
- [List of Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)
8181

8282
## Contributing
8383

docs/source/contributing/dockerfile/dockerfile.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Dockerfile
22

33
We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
4-
More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
4+
More information about deploying with Docker can be found [here](#deployment-docker).
55

66
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
77

docs/source/contributing/model/registration.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Model Registration
44

55
vLLM relies on a model registry to determine how to run each model.
6-
A list of pre-registered architectures can be found on the [Supported Models](#supported-models) page.
6+
A list of pre-registered architectures can be found [here](#supported-models).
77

88
If your model is not on this list, you must register it to vLLM.
99
This page provides detailed instructions on how to do so.
@@ -16,7 +16,7 @@ This gives you the ability to modify the codebase and test your model.
1616
After you have implemented your model (see [tutorial](#new-model-basic)), put it into the <gh-dir:vllm/model_executor/models> directory.
1717
Then, add your model class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is automatically registered upon importing vLLM.
1818
You should also include an example HuggingFace repository for this model in <gh-file:tests/models/registry.py> to run the unit tests.
19-
Finally, update the [Supported Models](#supported-models) documentation page to promote your model!
19+
Finally, update our [list of supported models](#supported-models) to promote your model!
2020

2121
```{important}
2222
The list of models in each section should be maintained in alphabetical order.

docs/source/serving/deploying_with_docker.md renamed to docs/source/deployment/docker.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-docker)=
1+
(deployment-docker)=
22

3-
# Deploying with Docker
3+
# Using Docker
44

55
## Use vLLM's Official Docker Image
66

docs/source/serving/deploying_with_bentoml.md renamed to docs/source/deployment/frameworks/bentoml.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-bentoml)=
1+
(deployment-bentoml)=
22

3-
# Deploying with BentoML
3+
# BentoML
44

55
[BentoML](https://github.com/bentoml/BentoML) allows you to deploy a large language model (LLM) server with vLLM as the backend, which exposes OpenAI-compatible endpoints. You can serve the model locally or containerize it as an OCI-complicant image and deploy it on Kubernetes.
66

docs/source/serving/deploying_with_cerebrium.md renamed to docs/source/deployment/frameworks/cerebrium.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-cerebrium)=
1+
(deployment-cerebrium)=
22

3-
# Deploying with Cerebrium
3+
# Cerebrium
44

55
```{raw} html
66
<p align="center">

docs/source/serving/deploying_with_dstack.md renamed to docs/source/deployment/frameworks/dstack.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-dstack)=
1+
(deployment-dstack)=
22

3-
# Deploying with dstack
3+
# dstack
44

55
```{raw} html
66
<p align="center">

docs/source/serving/deploying_with_helm.md renamed to docs/source/deployment/frameworks/helm.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-helm)=
1+
(deployment-helm)=
22

3-
# Deploying with Helm
3+
# Helm
44

55
A Helm chart to deploy vLLM for Kubernetes
66

@@ -38,7 +38,7 @@ chart **including persistent volumes** and deletes the release.
3838

3939
## Architecture
4040

41-
```{image} architecture_helm_deployment.png
41+
```{image} /assets/deployment/architecture_helm_deployment.png
4242
```
4343

4444
## Values
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Using other frameworks
2+
3+
```{toctree}
4+
:maxdepth: 1
5+
6+
bentoml
7+
cerebrium
8+
dstack
9+
helm
10+
lws
11+
skypilot
12+
triton
13+
```

docs/source/serving/deploying_with_lws.md renamed to docs/source/deployment/frameworks/lws.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-lws)=
1+
(deployment-lws)=
22

3-
# Deploying with LWS
3+
# LWS
44

55
LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads.
66
A major use case is for multi-host/multi-node distributed inference.

docs/source/serving/run_on_sky.md renamed to docs/source/deployment/frameworks/skypilot.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(on-cloud)=
1+
(deployment-skypilot)=
22

3-
# Deploying and scaling up with SkyPilot
3+
# SkyPilot
44

55
```{raw} html
66
<p align="center">
@@ -12,9 +12,9 @@ vLLM can be **run and scaled to multiple service replicas on clouds and Kubernet
1212

1313
## Prerequisites
1414

15-
- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model {code}`meta-llama/Meta-Llama-3-8B-Instruct`.
15+
- Go to the [HuggingFace model page](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) and request access to the model `meta-llama/Meta-Llama-3-8B-Instruct`.
1616
- Check that you have installed SkyPilot ([docs](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html)).
17-
- Check that {code}`sky check` shows clouds or Kubernetes are enabled.
17+
- Check that `sky check` shows clouds or Kubernetes are enabled.
1818

1919
```console
2020
pip install skypilot-nightly
Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
(deploying-with-triton)=
1+
(deployment-triton)=
22

3-
# Deploying with NVIDIA Triton
3+
# NVIDIA Triton
44

55
The [Triton Inference Server](https://github.com/triton-inference-server) hosts a tutorial demonstrating how to quickly deploy a simple [facebook/opt-125m](https://huggingface.co/facebook/opt-125m) model using vLLM. Please see [Deploying a vLLM model in Triton](https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md#deploying-a-vllm-model-in-triton) for more details.
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
# External Integrations
2+
3+
```{toctree}
4+
:maxdepth: 1
5+
6+
kserve
7+
kubeai
8+
llamastack
9+
```

docs/source/serving/deploying_with_kserve.md renamed to docs/source/deployment/integrations/kserve.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-kserve)=
1+
(deployment-kserve)=
22

3-
# Deploying with KServe
3+
# KServe
44

55
vLLM can be deployed with [KServe](https://github.com/kserve/kserve) on Kubernetes for highly scalable distributed model serving.
66

docs/source/serving/deploying_with_kubeai.md renamed to docs/source/deployment/integrations/kubeai.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-kubeai)=
1+
(deployment-kubeai)=
22

3-
# Deploying with KubeAI
3+
# KubeAI
44

55
[KubeAI](https://github.com/substratusai/kubeai) is a Kubernetes operator that enables you to deploy and manage AI models on Kubernetes. It provides a simple and scalable way to deploy vLLM in production. Functionality such as scale-from-zero, load based autoscaling, model caching, and much more is provided out of the box with zero external dependencies.
66

docs/source/serving/serving_with_llamastack.md renamed to docs/source/deployment/integrations/llamastack.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(run-on-llamastack)=
1+
(deployment-llamastack)=
22

3-
# Serving with Llama Stack
3+
# Llama Stack
44

55
vLLM is also available via [Llama Stack](https://github.com/meta-llama/llama-stack) .
66

docs/source/serving/deploying_with_k8s.md renamed to docs/source/deployment/k8s.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
(deploying-with-k8s)=
1+
(deployment-k8s)=
22

3-
# Deploying with Kubernetes
3+
# Using Kubernetes
44

55
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.
66

docs/source/serving/deploying_with_nginx.md renamed to docs/source/deployment/nginx.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
(nginxloadbalancer)=
22

3-
# Deploying with Nginx Loadbalancer
3+
# Using Nginx
44

55
This document shows how to launch multiple vLLM serving containers and use Nginx to act as a load balancer between the servers.
66

docs/source/design/arch_overview.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ More API details can be found in the {doc}`Offline Inference
5757

5858
The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.
5959

60-
### OpenAI-compatible API server
60+
### OpenAI-Compatible API Server
6161

6262
The second primary interface to vLLM is via its OpenAI-compatible API server.
6363
This server can be started using the `vllm serve` command.

docs/source/features/disagg_prefill.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
(disagg-prefill)=
22

3-
# Disaggregated prefilling (experimental)
3+
# Disaggregated Prefilling (experimental)
44

5-
This page introduces you the disaggregated prefilling feature in vLLM. This feature is experimental and subject to change.
5+
This page introduces you the disaggregated prefilling feature in vLLM.
6+
7+
```{note}
8+
This feature is experimental and subject to change.
9+
```
610

711
## Why disaggregated prefilling?
812

docs/source/features/spec_decode.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
(spec-decode)=
22

3-
# Speculative decoding
3+
# Speculative Decoding
44

55
```{warning}
66
Please note that speculative decoding in vLLM is not yet optimized and does

docs/source/getting_started/installation/gpu-rocm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -148,7 +148,7 @@ $ export PYTORCH_ROCM_ARCH="gfx90a;gfx942"
148148
$ python3 setup.py develop
149149
```
150150

151-
This may take 5-10 minutes. Currently, {code}`pip install .` does not work for ROCm installation.
151+
This may take 5-10 minutes. Currently, `pip install .` does not work for ROCm installation.
152152

153153
```{tip}
154154
- Triton flash attention is used by default. For benchmarking purposes, it is recommended to run a warm up step before collecting perf numbers.

docs/source/getting_started/installation/hpu-gaudi.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ $ python setup.py develop
8282

8383
## Supported Features
8484

85-
- [Offline batched inference](#offline-batched-inference)
85+
- [Offline inference](#offline-inference)
8686
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
8787
- HPU autodetection - no need to manually select device within vLLM
8888
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators

docs/source/getting_started/quickstart.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,32 @@
22

33
# Quickstart
44

5-
This guide will help you quickly get started with vLLM to:
5+
This guide will help you quickly get started with vLLM to perform:
66

7-
- [Run offline batched inference](#offline-batched-inference)
8-
- [Run OpenAI-compatible inference](#openai-compatible-server)
7+
- [Offline batched inference](#quickstart-offline)
8+
- [Online inference using OpenAI-compatible server](#quickstart-online)
99

1010
## Prerequisites
1111

1212
- OS: Linux
1313
- Python: 3.9 -- 3.12
14-
- GPU: compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100, etc.)
1514

1615
## Installation
1716

18-
You can install vLLM using pip. It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
17+
If you are using NVIDIA GPUs, you can install vLLM using [pip](https://pypi.org/project/vllm/) directly.
18+
It's recommended to use [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html) to create and manage Python environments.
1919

2020
```console
2121
$ conda create -n myenv python=3.10 -y
2222
$ conda activate myenv
2323
$ pip install vllm
2424
```
2525

26-
Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.
26+
```{note}
27+
For non-CUDA platforms, please refer [here](#installation-index) for specific instructions on how to install vLLM.
28+
```
2729

28-
(offline-batched-inference)=
30+
(quickstart-offline)=
2931

3032
## Offline Batched Inference
3133

@@ -73,7 +75,7 @@ for output in outputs:
7375
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
7476
```
7577

76-
(openai-compatible-server)=
78+
(quickstart-online)=
7779

7880
## OpenAI-Compatible Server
7981

docs/source/index.md

Lines changed: 28 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -65,32 +65,14 @@ getting_started/troubleshooting
6565
getting_started/faq
6666
```
6767

68-
```{toctree}
69-
:caption: Serving
70-
:maxdepth: 1
71-
72-
serving/openai_compatible_server
73-
serving/deploying_with_docker
74-
serving/deploying_with_k8s
75-
serving/deploying_with_helm
76-
serving/deploying_with_nginx
77-
serving/distributed_serving
78-
serving/metrics
79-
serving/integrations
80-
serving/tensorizer
81-
serving/runai_model_streamer
82-
serving/engine_args
83-
serving/env_vars
84-
serving/usage_stats
85-
```
86-
8768
```{toctree}
8869
:caption: Models
8970
:maxdepth: 1
9071
91-
models/supported_models
9272
models/generative_models
9373
models/pooling_models
74+
models/supported_models
75+
models/extensions/index
9476
```
9577

9678
```{toctree}
@@ -99,7 +81,6 @@ models/pooling_models
9981
10082
features/quantization/index
10183
features/lora
102-
features/multimodal_inputs
10384
features/tool_calling
10485
features/structured_outputs
10586
features/automatic_prefix_caching
@@ -108,6 +89,32 @@ features/spec_decode
10889
features/compatibility_matrix
10990
```
11091

92+
```{toctree}
93+
:caption: Inference and Serving
94+
:maxdepth: 1
95+
96+
serving/offline_inference
97+
serving/openai_compatible_server
98+
serving/multimodal_inputs
99+
serving/distributed_serving
100+
serving/metrics
101+
serving/engine_args
102+
serving/env_vars
103+
serving/usage_stats
104+
serving/integrations/index
105+
```
106+
107+
```{toctree}
108+
:caption: Deployment
109+
:maxdepth: 1
110+
111+
deployment/docker
112+
deployment/k8s
113+
deployment/nginx
114+
deployment/frameworks/index
115+
deployment/integrations/index
116+
```
117+
111118
```{toctree}
112119
:caption: Performance
113120
:maxdepth: 1
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# Built-in Extensions
2+
3+
```{toctree}
4+
:maxdepth: 1
5+
6+
runai_model_streamer
7+
tensorizer
8+
```

docs/source/serving/runai_model_streamer.md renamed to docs/source/models/extensions/runai_model_streamer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
(runai-model-streamer)=
22

3-
# Loading Models with Run:ai Model Streamer
3+
# Loading models with Run:ai Model Streamer
44

55
Run:ai Model Streamer is a library to read tensors in concurrency, while streaming it to GPU memory.
66
Further reading can be found in [Run:ai Model Streamer Documentation](https://github.com/run-ai/runai-model-streamer/blob/master/docs/README.md).

docs/source/serving/tensorizer.md renamed to docs/source/models/extensions/tensorizer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
(tensorizer)=
22

3-
# Loading Models with CoreWeave's Tensorizer
3+
# Loading models with CoreWeave's Tensorizer
44

55
vLLM supports loading models with [CoreWeave's Tensorizer](https://docs.coreweave.com/coreweave-machine-learning-and-ai/inference/tensorizer).
66
vLLM model tensors that have been serialized to disk, an HTTP/HTTPS endpoint, or S3 endpoint can be deserialized

0 commit comments

Comments
 (0)