Skip to content

Commit 299f02a

Browse files
committed
Move Community and API Reference to the bottom
Signed-off-by: DarkLight1337 <[email protected]>
1 parent 65097ca commit 299f02a

File tree

3 files changed

+40
-26
lines changed

3 files changed

+40
-26
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,7 +41,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
4141
vLLM is fast with:
4242

4343
- State-of-the-art serving throughput
44-
- Efficient management of attention key and value memory with **PagedAttention**
44+
- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai)
4545
- Continuous batching of incoming requests
4646
- Fast model execution with CUDA/HIP graph
4747
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8.

docs/source/design/automatic_prefix_caching.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
# Automatic Prefix Caching
44

5-
The core idea of [PagedAttention](#design-paged-attention) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
5+
The core idea of [PagedAttention](https://vllm.ai) is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand.
66

77
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block.
88

docs/source/index.md

Lines changed: 38 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ vLLM is a fast and easy-to-use library for LLM inference and serving.
2626
vLLM is fast with:
2727

2828
- State-of-the-art serving throughput
29-
- Efficient management of attention key and value memory with **PagedAttention**
29+
- Efficient management of attention key and value memory with [**PagedAttention**](https://vllm.ai)
3030
- Continuous batching of incoming requests
3131
- Fast model execution with CUDA/HIP graph
3232
- Quantization: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), INT4, INT8, and FP8
@@ -54,6 +54,8 @@ For more information, check out the following:
5454

5555
## Documentation
5656

57+
% How to start using vLLM?
58+
5759
```{toctree}
5860
:caption: Getting Started
5961
:maxdepth: 1
@@ -65,6 +67,8 @@ getting_started/troubleshooting
6567
getting_started/faq
6668
```
6769

70+
% What does vLLM support?
71+
6872
```{toctree}
6973
:caption: Models
7074
:maxdepth: 1
@@ -75,6 +79,8 @@ models/supported_models
7579
models/extensions/index
7680
```
7781

82+
% Additional capabilities
83+
7884
```{toctree}
7985
:caption: Features
8086
:maxdepth: 1
@@ -89,6 +95,8 @@ features/spec_decode
8995
features/compatibility_matrix
9096
```
9197

98+
% Details about running vLLM
99+
92100
```{toctree}
93101
:caption: Inference and Serving
94102
:maxdepth: 1
@@ -104,6 +112,8 @@ serving/usage_stats
104112
serving/integrations/index
105113
```
106114

115+
% Scaling up vLLM for production
116+
107117
```{toctree}
108118
:caption: Deployment
109119
:maxdepth: 1
@@ -115,6 +125,8 @@ deployment/frameworks/index
115125
deployment/integrations/index
116126
```
117127

128+
% Making the most out of vLLM
129+
118130
```{toctree}
119131
:caption: Performance
120132
:maxdepth: 1
@@ -123,28 +135,7 @@ performance/optimization
123135
performance/benchmarks
124136
```
125137

126-
% Community: User community resources
127-
128-
```{toctree}
129-
:caption: Community
130-
:maxdepth: 1
131-
132-
community/meetups
133-
community/sponsors
134-
```
135-
136-
```{toctree}
137-
:caption: API Reference
138-
:maxdepth: 2
139-
140-
api/offline_inference/index
141-
api/engine/index
142-
api/inference_params
143-
api/multimodal/index
144-
api/model/index
145-
```
146-
147-
% Design Documents: Details about vLLM internals
138+
% Explanation of vLLM internals
148139

149140
```{toctree}
150141
:caption: Design Documents
@@ -159,7 +150,7 @@ design/automatic_prefix_caching
159150
design/multiprocessing
160151
```
161152

162-
% Developer Guide: How to contribute to the vLLM project
153+
% How to contribute to the vLLM project
163154

164155
```{toctree}
165156
:caption: Developer Guide
@@ -172,6 +163,29 @@ contributing/model/index
172163
contributing/vulnerability_management
173164
```
174165

166+
% Technical API specifications
167+
168+
```{toctree}
169+
:caption: API Reference
170+
:maxdepth: 2
171+
172+
api/offline_inference/index
173+
api/engine/index
174+
api/inference_params
175+
api/multimodal/index
176+
api/model/index
177+
```
178+
179+
% Latest news and acknowledgements
180+
181+
```{toctree}
182+
:caption: Community
183+
:maxdepth: 1
184+
185+
community/meetups
186+
community/sponsors
187+
```
188+
175189
# Indices and tables
176190

177191
- {ref}`genindex`

0 commit comments

Comments
 (0)