Skip to content

Commit 02b0c09

Browse files
DarkLight1337mzusman
authored andcommitted
[Doc] Improve GitHub links (vllm-project#11491)
Signed-off-by: DarkLight1337 <[email protected]>
1 parent 57d0bea commit 02b0c09

31 files changed

+147
-136
lines changed

docs/source/conf.py

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,35 @@
7474
html_static_path = ["_static"]
7575
html_js_files = ["custom.js"]
7676

77+
myst_url_schemes = {
78+
'http': None,
79+
'https': None,
80+
'mailto': None,
81+
'ftp': None,
82+
"gh-issue": {
83+
"url":
84+
"https://github.com/vllm-project/vllm/issues/{{path}}#{{fragment}}",
85+
"title": "Issue #{{path}}",
86+
"classes": ["github"],
87+
},
88+
"gh-pr": {
89+
"url":
90+
"https://github.com/vllm-project/vllm/pull/{{path}}#{{fragment}}",
91+
"title": "Pull Request #{{path}}",
92+
"classes": ["github"],
93+
},
94+
"gh-dir": {
95+
"url": "https://github.com/vllm-project/vllm/tree/main/{{path}}",
96+
"title": "{{path}}",
97+
"classes": ["github"],
98+
},
99+
"gh-file": {
100+
"url": "https://github.com/vllm-project/vllm/blob/main/{{path}}",
101+
"title": "{{path}}",
102+
"classes": ["github"],
103+
},
104+
}
105+
77106
# see https://docs.readthedocs.io/en/stable/reference/environment-variables.html # noqa
78107
READTHEDOCS_VERSION_TYPE = os.environ.get('READTHEDOCS_VERSION_TYPE')
79108
if READTHEDOCS_VERSION_TYPE == "tag":

docs/source/contributing/dockerfile/dockerfile.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Dockerfile
22

3-
See [here](https://github.com/vllm-project/vllm/blob/main/Dockerfile) for the main Dockerfile to construct
4-
the image for running an OpenAI compatible server with vLLM. More information about deploying with Docker can be found [here](https://docs.vllm.ai/en/stable/serving/deploying_with_docker.html).
3+
We provide a <gh-file:Dockerfile> to construct the image for running an OpenAI compatible server with vLLM.
4+
More information about deploying with Docker can be found [here](../../serving/deploying_with_docker.md).
55

66
Below is a visual representation of the multi-stage Dockerfile. The build graph contains the following nodes:
77

docs/source/contributing/overview.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -13,11 +13,12 @@ Finally, one of the most impactful ways to support us is by raising awareness ab
1313

1414
## License
1515

16-
See [LICENSE](https://github.com/vllm-project/vllm/tree/main/LICENSE).
16+
See <gh-file:LICENSE>.
1717

1818
## Developing
1919

20-
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation. Check out the [building from source](https://docs.vllm.ai/en/latest/getting_started/installation.html#build-from-source) documentation for details.
20+
Depending on the kind of development you'd like to do (e.g. Python, CUDA), you can choose to build vLLM with or without compilation.
21+
Check out the [building from source](#build-from-source) documentation for details.
2122

2223
## Testing
2324

@@ -43,7 +44,7 @@ Currently, the repository does not pass the `mypy` tests.
4344
If you encounter a bug or have a feature request, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
4445

4546
```{important}
46-
If you discover a security vulnerability, please follow the instructions [here](https://github.com/vllm-project/vllm/tree/main/SECURITY.md#reporting-a-vulnerability).
47+
If you discover a security vulnerability, please follow the instructions [here](gh-file:SECURITY.md#reporting-a-vulnerability).
4748
```
4849

4950
## Pull Requests & Code Reviews
@@ -54,9 +55,9 @@ code quality and improve the efficiency of the review process.
5455

5556
### DCO and Signed-off-by
5657

57-
When contributing changes to this project, you must agree to the [DCO](https://github.com/vllm-project/vllm/tree/main/DCO).
58+
When contributing changes to this project, you must agree to the <gh-file:DCO>.
5859
Commits must include a `Signed-off-by:` header which certifies agreement with
59-
the terms of the [DCO](https://github.com/vllm-project/vllm/tree/main/DCO).
60+
the terms of the DCO.
6061

6162
Using `-s` with `git commit` will automatically add this header.
6263

@@ -89,8 +90,7 @@ If the PR spans more than one category, please include all relevant prefixes.
8990
The PR needs to meet the following code quality standards:
9091

9192
- We adhere to [Google Python style guide](https://google.github.io/styleguide/pyguide.html) and [Google C++ style guide](https://google.github.io/styleguide/cppguide.html).
92-
- Pass all linter checks. Please use [format.sh](https://github.com/vllm-project/vllm/blob/main/format.sh) to format your
93-
code.
93+
- Pass all linter checks. Please use <gh-file:format.sh> to format your code.
9494
- The code needs to be well-documented to ensure future contributors can easily
9595
understand the code.
9696
- Include sufficient tests to ensure the project stays correct and robust. This

docs/source/contributing/profiling/profiling_index.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,13 @@ Set the env variable VLLM_RPC_TIMEOUT to a big number before you start the serve
2222
`export VLLM_RPC_TIMEOUT=1800000`
2323
```
2424

25-
## Example commands and usage:
25+
## Example commands and usage
2626

27-
### Offline Inference:
27+
### Offline Inference
2828

29-
Refer to [examples/offline_inference_with_profiler.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_with_profiler.py) for an example.
29+
Refer to <gh-file:examples/offline_inference_with_profiler.py> for an example.
3030

31-
### OpenAI Server:
31+
### OpenAI Server
3232

3333
```bash
3434
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B

docs/source/design/arch_overview.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ for output in outputs:
5555
More API details can be found in the {doc}`Offline Inference
5656
</dev/offline_inference/offline_index>` section of the API docs.
5757

58-
The code for the `LLM` class can be found in [vllm/entrypoints/llm.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/llm.py).
58+
The code for the `LLM` class can be found in <gh-file:vllm/entrypoints/llm.py>.
5959

6060
### OpenAI-compatible API server
6161

@@ -66,7 +66,7 @@ This server can be started using the `vllm serve` command.
6666
vllm serve <model>
6767
```
6868

69-
The code for the `vllm` CLI can be found in [vllm/scripts.py](https://github.com/vllm-project/vllm/blob/main/vllm/scripts.py).
69+
The code for the `vllm` CLI can be found in <gh-file:vllm/scripts.py>.
7070

7171
Sometimes you may see the API server entrypoint used directly instead of via the
7272
`vllm` CLI command. For example:
@@ -75,7 +75,7 @@ Sometimes you may see the API server entrypoint used directly instead of via the
7575
python -m vllm.entrypoints.openai.api_server --model <model>
7676
```
7777

78-
That code can be found in [vllm/entrypoints/openai/api_server.py](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py).
78+
That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.
7979

8080
More details on the API server can be found in the {doc}`OpenAI Compatible
8181
Server </serving/openai_compatible_server>` document.
@@ -105,7 +105,7 @@ processing.
105105
- **Output Processing**: Processes the outputs generated by the model, decoding the
106106
token IDs from a language model into human-readable text.
107107

108-
The code for `LLMEngine` can be found in [vllm/engine/llm_engine.py].
108+
The code for `LLMEngine` can be found in <gh-file:vllm/engine/llm_engine.py>.
109109

110110
### AsyncLLMEngine
111111

@@ -115,10 +115,9 @@ incoming requests. The `AsyncLLMEngine` is designed for online serving, where it
115115
can handle multiple concurrent requests and stream outputs to clients.
116116

117117
The OpenAI-compatible API server uses the `AsyncLLMEngine`. There is also a demo
118-
API server that serves as a simpler example in
119-
[vllm/entrypoints/api_server.py].
118+
API server that serves as a simpler example in <gh-file:vllm/entrypoints/api_server.py>.
120119

121-
The code for `AsyncLLMEngine` can be found in [vllm/engine/async_llm_engine.py].
120+
The code for `AsyncLLMEngine` can be found in <gh-file:vllm/engine/async_llm_engine.py>.
122121

123122
## Worker
124123

@@ -252,7 +251,3 @@ big problem.
252251
253252
In summary, the complete config object `VllmConfig` can be treated as an
254253
engine-level global state that is shared among all vLLM classes.
255-
256-
[vllm/engine/async_llm_engine.py]: https://github.com/vllm-project/vllm/tree/main/vllm/engine/async_llm_engine.py
257-
[vllm/engine/llm_engine.py]: https://github.com/vllm-project/vllm/tree/main/vllm/engine/llm_engine.py
258-
[vllm/entrypoints/api_server.py]: https://github.com/vllm-project/vllm/tree/main/vllm/entrypoints/api_server.py

docs/source/design/multiprocessing.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@
22

33
## Debugging
44

5-
Please see the [Debugging
6-
Tips](https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing)
5+
Please see the [Debugging Tips](#debugging-python-multiprocessing)
76
page for information on known issues and how to solve them.
87

98
## Introduction
109

11-
*Note that source code references are to the state of the code at the time of writing in December, 2024.*
10+
```{important}
11+
The source code references are to the state of the code at the time of writing in December, 2024.
12+
```
1213

1314
The use of Python multiprocessing in vLLM is complicated by:
1415

@@ -20,7 +21,7 @@ This document describes how vLLM deals with these challenges.
2021

2122
## Multiprocessing Methods
2223

23-
[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods) include:
24+
[Python multiprocessing methods](https://docs.python.org/3/library/multiprocessing.html.md#contexts-and-start-methods) include:
2425

2526
- `spawn` - spawn a new Python process. This will be the default as of Python
2627
3.14.
@@ -82,7 +83,7 @@ There are other miscellaneous places hard-coding the use of `spawn`:
8283

8384
Related PRs:
8485

85-
- <https://github.com/vllm-project/vllm/pull/8823>
86+
- <gh-pr:8823>
8687

8788
## Prior State in v1
8889

@@ -96,7 +97,7 @@ engine core.
9697

9798
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L93-L95>
9899
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/llm_engine.py#L70-L77>
99-
- https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/core_client.py#L44-L45
100+
- <https://github.com/vllm-project/vllm/blob/d05f88679bedd73939251a17c3d785a354b2946c/vllm/v1/engine/core_client.py#L44-L45>
100101

101102
It was off by default for all the reasons mentioned above - compatibility with
102103
dependencies and code using vLLM as a library.
@@ -119,17 +120,17 @@ instruct users to either add a `__main__` guard or to disable multiprocessing.
119120
If that known-failure case occurs, the user will see two messages that explain
120121
what is happening. First, a log message from vLLM:
121122

122-
```
123-
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
124-
initialized. We must use the `spawn` multiprocessing start method. Setting
125-
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
126-
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
127-
for more information.
123+
```console
124+
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
125+
initialized. We must use the `spawn` multiprocessing start method. Setting
126+
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
127+
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
128+
for more information.
128129
```
129130

130131
Second, Python itself will raise an exception with a nice explanation:
131132

132-
```
133+
```console
133134
RuntimeError:
134135
An attempt has been made to start a new process before the
135136
current process has finished its bootstrapping phase.

docs/source/generate_examples.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,10 @@ def generate_examples():
3636

3737
# Generate the example docs for each example script
3838
for script_path, doc_path in zip(script_paths, doc_paths):
39-
script_url = f"https://github.com/vllm-project/vllm/blob/main/examples/{script_path.name}"
4039
# Make script_path relative to doc_path and call it include_path
4140
include_path = '../../../..' / script_path.relative_to(root_dir)
4241
content = (f"{generate_title(doc_path.stem)}\n\n"
43-
f"Source: <{script_url}>.\n\n"
42+
f"Source: <gh-file:examples/{script_path.name}>.\n\n"
4443
f"```{{literalinclude}} {include_path}\n"
4544
":language: python\n"
4645
":linenos:\n```")

docs/source/getting_started/amd-installation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Installation options:
2222

2323
You can build and install vLLM from source.
2424

25-
First, build a docker image from [Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm) and launch a docker container from the image.
25+
First, build a docker image from <gh-file:Dockerfile.rocm> and launch a docker container from the image.
2626
It is important that the user kicks off the docker build using buildkit. Either the user put DOCKER_BUILDKIT=1 as environment variable when calling docker build command, or the user needs to setup buildkit in the docker daemon configuration /etc/docker/daemon.json as follows and restart the daemon:
2727

2828
```console
@@ -33,7 +33,7 @@ It is important that the user kicks off the docker build using buildkit. Either
3333
}
3434
```
3535

36-
[Dockerfile.rocm](https://github.com/vllm-project/vllm/blob/main/Dockerfile.rocm) uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches.
36+
<gh-file:Dockerfile.rocm> uses ROCm 6.2 by default, but also supports ROCm 5.7, 6.0 and 6.1 in older vLLM branches.
3737
It provides flexibility to customize the build of docker image using the following arguments:
3838

3939
- `BASE_IMAGE`: specifies the base image used when running `docker build`, specifically the PyTorch on ROCm base image.

docs/source/getting_started/cpu-installation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -145,10 +145,10 @@ $ python examples/offline_inference.py
145145

146146
- On CPU based setup with NUMA enabled, the memory access performance may be largely impacted by the [topology](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.md#non-uniform-memory-access-numa). For NUMA architecture, two optimizations are to recommended: Tensor Parallel or Data Parallel.
147147

148-
- Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](https://github.com/vllm-project/vllm/pull/6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
148+
- Using Tensor Parallel for a latency constraints deployment: following GPU backend design, a Megatron-LM's parallel algorithm will be used to shard the model, based on the number of NUMA nodes (e.g. TP = 2 for a two NUMA node system). With [TP feature on CPU](gh-pr:6125) merged, Tensor Parallel is supported for serving and offline inferencing. In general each NUMA node is treated as one GPU card. Below is the example script to enable Tensor Parallel = 2 for serving:
149149

150150
```console
151151
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
152152
```
153153

154-
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
154+
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx.md) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).

docs/source/getting_started/debugging.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ To isolate the model downloading and loading issue, you can use the `--load-form
2424

2525
## Model is too large
2626

27-
If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](https://docs.vllm.ai/en/latest/serving/distributed_serving.html#distributed-inference-and-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using [this example](https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html) . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
27+
If the model is too large to fit in a single GPU, you might want to [consider tensor parallelism](#distributed-serving) to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using <gh-file:examples/save_sharded_state.py>. The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism.
2828

2929
## Enable more logging
3030

@@ -139,6 +139,7 @@ A multi-node environment is more complicated than a single-node one. If you see
139139
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
140140
```
141141

142+
(debugging-python-multiprocessing)=
142143
## Python multiprocessing
143144

144145
### `RuntimeError` Exception
@@ -195,5 +196,5 @@ if __name__ == '__main__':
195196

196197
## Known Issues
197198

198-
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](https://github.com/vllm-project/vllm/pull/6759).
199-
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](https://github.com/vllm-project/vllm/issues/5723#issuecomment-2554389656) .
199+
- In `v0.5.2`, `v0.5.3`, and `v0.5.3.post1`, there is a bug caused by [zmq](https://github.com/zeromq/pyzmq/issues/2000) , which can occasionally cause vLLM to hang depending on the machine configuration. The solution is to upgrade to the latest version of `vllm` to include the [fix](gh-pr:6759).
200+
- To circumvent a NCCL [bug](https://github.com/NVIDIA/nccl/issues/1234) , all vLLM processes will set an environment variable ``NCCL_CUMEM_ENABLE=0`` to disable NCCL's ``cuMem`` allocator. It does not affect performance but only gives memory benefits. When external processes want to set up a NCCL connection with vLLM's processes, they should also set this environment variable, otherwise, inconsistent environment setup will cause NCCL to hang or crash, as observed in the [RLHF integration](https://github.com/OpenRLHF/OpenRLHF/pull/604) and the [discussion](gh-issue:5723#issuecomment-2554389656) .

docs/source/getting_started/gaudi-installation.md

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -80,10 +80,8 @@ $ python setup.py develop
8080

8181
## Supported Features
8282

83-
- [Offline batched
84-
inference](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#offline-batched-inference)
85-
- Online inference via [OpenAI-Compatible
86-
Server](https://docs.vllm.ai/en/latest/getting_started/quickstart.html#openai-compatible-server)
83+
- [Offline batched inference](#offline-batched-inference)
84+
- Online inference via [OpenAI-Compatible Server](#openai-compatible-server)
8785
- HPU autodetection - no need to manually select device within vLLM
8886
- Paged KV cache with algorithms enabled for Intel Gaudi accelerators
8987
- Custom Intel Gaudi implementations of Paged Attention, KV cache ops,

0 commit comments

Comments
 (0)