Skip to content

Commit 8a69e0e

Browse files
[CI/Build] Auto-fix Markdown files (#12941)
1 parent 4c8dd12 commit 8a69e0e

File tree

20 files changed

+158
-141
lines changed

20 files changed

+158
-141
lines changed

.buildkite/nightly-benchmarks/README.md

+18-28
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,13 @@
11
# vLLM benchmark suite
22

3-
43
## Introduction
54

65
This directory contains two sets of benchmark for vllm.
6+
77
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
88
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
99

10-
11-
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
12-
10+
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
1311

1412
## Performance benchmark quick overview
1513

@@ -19,17 +17,14 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan
1917

2018
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
2119

22-
2320
## Nightly benchmark quick overview
2421

25-
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
22+
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
2623

2724
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
2825

2926
**Benchmarking Duration**: about 3.5hrs.
3027

31-
32-
3328
## Trigger the benchmark
3429

3530
Performance benchmark will be triggered when:
@@ -39,16 +34,11 @@ Performance benchmark will be triggered when:
3934
Nightly benchmark will be triggered when:
4035
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
4136

42-
43-
44-
4537
## Performance benchmark details
4638

47-
4839
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
4940

50-
51-
#### Latency test
41+
### Latency test
5242

5343
Here is an example of one test inside `latency-tests.json`:
5444

@@ -68,23 +58,25 @@ Here is an example of one test inside `latency-tests.json`:
6858
```
6959

7060
In this example:
71-
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
72-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
61+
62+
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
63+
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
7364

7465
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
7566

7667
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
7768

69+
### Throughput test
7870

79-
#### Throughput test
8071
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
8172

8273
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
8374

84-
#### Serving test
75+
### Serving test
76+
8577
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
8678

87-
```
79+
```json
8880
[
8981
{
9082
"test_name": "serving_llama8B_tp1_sharegpt",
@@ -109,6 +101,7 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
109101
```
110102

111103
Inside this example:
104+
112105
- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
113106
- The `server-parameters` includes the command line arguments for vLLM server.
114107
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
@@ -118,36 +111,33 @@ The number of this test is less stable compared to the delay and latency benchma
118111

119112
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
120113

121-
#### Visualizing the results
114+
### Visualizing the results
115+
122116
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
123117
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
124118
If you do not see the table, please wait till the benchmark finish running.
125119
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
126120
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
127121

128-
129-
130122
## Nightly test details
131123

132124
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
133125

126+
### Workflow
134127

135-
#### Workflow
136-
137-
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
128+
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
138129
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
139130
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
140131
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
141132

142-
#### Nightly tests
133+
### Nightly tests
143134

144135
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
145136

146-
#### Docker containers
137+
### Docker containers
147138

148139
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
149140

150141
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
151142

152143
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
153-

.buildkite/nightly-benchmarks/nightly-annotation.md

+10-11
Original file line numberDiff line numberDiff line change
@@ -9,20 +9,19 @@ This file contains the downloading link for benchmarking results.
99

1010
Please download the visualization scripts in the post
1111

12-
1312
## Results reproduction
1413

1514
- Find the docker we use in `benchmarking pipeline`
1615
- Deploy the docker, and inside the docker:
17-
- Download `nightly-benchmarks.zip`.
18-
- In the same folder, run the following code
19-
```
20-
export HF_TOKEN=<your HF token>
21-
apt update
22-
apt install -y git
23-
unzip nightly-benchmarks.zip
24-
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25-
```
16+
- Download `nightly-benchmarks.zip`.
17+
- In the same folder, run the following code:
2618

27-
And the results will be inside `./benchmarks/results`.
19+
```console
20+
export HF_TOKEN=<your HF token>
21+
apt update
22+
apt install -y git
23+
unzip nightly-benchmarks.zip
24+
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
25+
```
2826

27+
And the results will be inside `./benchmarks/results`.

.buildkite/nightly-benchmarks/nightly-descriptions.md

+3-3
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22
# Nightly benchmark
33

44
This benchmark aims to:
5+
56
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
67
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
78

89
Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
910

1011
Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
1112

12-
1313
## Setup
1414

1515
- Docker images:
@@ -33,7 +33,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
3333
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
3434
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
3535

36-
# Known issues
36+
## Known issues
3737

3838
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
39-
- TGI does not support `ignore-eos` flag.
39+
- TGI does not support `ignore-eos` flag.

.buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

+2-8
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,8 @@
77
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
88
- Evaluation metrics: end-to-end latency (mean, median, p99).
99

10-
1110
{latency_tests_markdown_table}
1211

13-
1412
## Throughput tests
1513

1614
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@@ -19,10 +17,8 @@
1917
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
2018
- Evaluation metrics: throughput.
2119

22-
2320
{throughput_tests_markdown_table}
2421

25-
2622
## Serving tests
2723

2824
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
@@ -33,13 +29,11 @@
3329
- We also added a speculative decoding test for llama-3 70B, under QPS 2
3430
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
3531

36-
3732
{serving_tests_markdown_table}
3833

39-
4034
## json version of the benchmarking tables
4135

42-
This section contains the data of the markdown tables above in JSON format.
36+
This section contains the data of the markdown tables above in JSON format.
4337
You can load the benchmarking tables into pandas dataframes as follows:
4438

4539
```python
@@ -54,9 +48,9 @@ serving_results = pd.DataFrame.from_dict(benchmarking_results["serving"])
5448
```
5549

5650
The json string for all benchmarking tables:
51+
5752
```json
5853
{benchmarking_results_in_json_string}
5954
```
6055

6156
You can also check the raw experiment data in the Artifact tab of the Buildkite page.
62-

.github/PULL_REQUEST_TEMPLATE.md

+2-1
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@ FILL IN THE PR DESCRIPTION HERE
22

33
FIX #xxxx (*link existing issues this PR will resolve*)
44

5-
**BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html **
5+
<!--- pyml disable-next-line no-emphasis-as-heading -->
6+
**BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing/overview.html>**

.pre-commit-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ repos:
3333
rev: v0.9.27
3434
hooks:
3535
- id: pymarkdown
36-
files: docs/.*
36+
args: [fix]
3737
- repo: https://github.com/rhysd/actionlint
3838
rev: v1.7.7
3939
hooks:

CODE_OF_CONDUCT.md

-1
Original file line numberDiff line numberDiff line change
@@ -125,4 +125,3 @@ Community Impact Guidelines were inspired by
125125
For answers to common questions about this code of conduct, see the
126126
[Contributor Covenant FAQ](https://www.contributor-covenant.org/faq). Translations are available at
127127
[Contributor Covenant translations](https://www.contributor-covenant.org/translations).
128-

README.md

+9-5
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ Easy, fast, and cheap LLM serving for everyone
1616
---
1717

1818
*Latest News* 🔥
19+
1920
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).
2021
- [2025/01] We hosted [the eighth vLLM meetup](https://lu.ma/zep56hui) with Google Cloud! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1epVkt4Zu8Jz_S5OhEHPc798emsYh2BwYfRuDDVEF7u4/edit?usp=sharing), and Google Cloud team [here](https://drive.google.com/file/d/1h24pHewANyRL11xy5dXUbvRC9F9Kkjix/view?usp=sharing).
2122
- [2024/12] vLLM joins [pytorch ecosystem](https://pytorch.org/blog/vllm-joins-pytorch)! Easy, Fast, and Cheap LLM Serving for Everyone!
@@ -33,7 +34,9 @@ Easy, fast, and cheap LLM serving for everyone
3334
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
3435

3536
---
37+
3638
## About
39+
3740
vLLM is a fast and easy-to-use library for LLM inference and serving.
3841

3942
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
@@ -127,6 +130,7 @@ We also have an official fundraising venue through [OpenCollective](https://open
127130
## Citation
128131

129132
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
133+
130134
```bibtex
131135
@inproceedings{kwon2023efficient,
132136
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
@@ -138,11 +142,11 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
138142

139143
## Contact Us
140144

141-
* For technical questions and feature requests, please use Github issues or discussions.
142-
* For discussing with fellow users and coordinating contributions and development, please use Slack.
143-
* For security disclosures, please use Github's security advisory feature.
144-
* For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
145+
- For technical questions and feature requests, please use Github issues or discussions.
146+
- For discussing with fellow users and coordinating contributions and development, please use Slack.
147+
- For security disclosures, please use Github's security advisory feature.
148+
- For collaborations and partnerships, please contact us at vllm-questions AT lists.berkeley.edu.
145149

146150
## Media Kit
147151

148-
* If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).
152+
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit).

benchmarks/README.md

+2
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
## Downloading the ShareGPT dataset
44

55
You can download the dataset by running:
6+
67
```bash
78
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
89
```
@@ -11,6 +12,7 @@ wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/r
1112

1213
The json file refers to several image datasets (coco, llava, etc.). The benchmark scripts
1314
will ignore a datapoint if the referred image is missing.
15+
1416
```bash
1517
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
1618
mkdir coco -p

0 commit comments

Comments
 (0)