Skip to content

Commit 1d0fab0

Browse files
committed
Merge branch 'main' into mm-dummy-data-builder
2 parents 213ce33 + a1b2b86 commit 1d0fab0

38 files changed

+514
-115
lines changed

.gitignore

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -79,10 +79,7 @@ instance/
7979

8080
# Sphinx documentation
8181
docs/_build/
82-
docs/source/getting_started/examples/*.rst
83-
!**/*.template.rst
84-
docs/source/getting_started/examples/*.md
85-
!**/*.template.md
82+
docs/source/getting_started/examples/
8683

8784
# PyBuilder
8885
.pybuilder/

Dockerfile.openvino

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ ARG GIT_REPO_CHECK=0
1414
RUN --mount=type=bind,source=.git,target=.git \
1515
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh ; fi
1616

17+
RUN python3 -m pip install -U pip
1718
# install build requirements
1819
RUN PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cpu" python3 -m pip install -r /workspace/requirements-build.txt
1920
# build vLLM with OpenVINO backend

Dockerfile.ppc64le

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ USER root
44

55
ENV PATH="/usr/local/cargo/bin:$PATH:/opt/conda/bin/"
66

7-
RUN apt-get update -y && apt-get install -y git wget curl vim libnuma-dev libsndfile-dev libprotobuf-dev build-essential ffmpeg libsm6 libxext6 libgl1
7+
RUN apt-get update -y && apt-get install -y git wget curl vim libnuma-dev libsndfile-dev libprotobuf-dev build-essential ffmpeg libsm6 libxext6 libgl1 libssl-dev
88

99
# Some packages in requirements-cpu are installed here
1010
# IBM provides optimized packages for ppc64le processors in the open-ce project for mamba
@@ -18,9 +18,8 @@ ARG GIT_REPO_CHECK=0
1818
RUN --mount=type=bind,source=.git,target=.git \
1919
if [ "$GIT_REPO_CHECK" != 0 ]; then bash tools/check_repo.sh; fi
2020

21-
# These packages will be in rocketce eventually
2221
RUN --mount=type=cache,target=/root/.cache/pip \
23-
pip install -v --prefer-binary --extra-index-url https://repo.fury.io/mgiessing \
22+
RUSTFLAGS='-L /opt/conda/lib' pip install -v --prefer-binary --extra-index-url https://repo.fury.io/mgiessing \
2423
'cmake>=3.26' ninja packaging 'setuptools-scm>=8' wheel jinja2 \
2524
torch==2.3.1 \
2625
-r requirements-cpu.txt \

README.md

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -90,28 +90,33 @@ vLLM is a community project. Our compute resources for development and testing a
9090

9191
<!-- Note: Please sort them in alphabetical order. -->
9292
<!-- Note: Please keep these consistent with docs/source/community/sponsors.md -->
93-
93+
Cash Donations:
9494
- a16z
95+
- Dropbox
96+
- Sequoia Capital
97+
- Skywork AI
98+
- ZhenFund
99+
100+
Compute Resources:
95101
- AMD
96102
- Anyscale
97103
- AWS
98104
- Crusoe Cloud
99105
- Databricks
100106
- DeepInfra
101-
- Dropbox
102107
- Google Cloud
103108
- Lambda Lab
104109
- Nebius
110+
- Novita AI
105111
- NVIDIA
106112
- Replicate
107113
- Roblox
108114
- RunPod
109-
- Sequoia Capital
110-
- Skywork AI
111115
- Trainy
112116
- UC Berkeley
113117
- UC San Diego
114-
- ZhenFund
118+
119+
Slack Sponsor: Anyscale
115120

116121
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.
117122

benchmarks/benchmark_latency.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,7 @@ def run_to_completion(profile_dir: Optional[str] = None):
5252
llm.generate(dummy_prompts,
5353
sampling_params=sampling_params,
5454
use_tqdm=False)
55-
print(p.key_averages())
55+
print(p.key_averages().table(sort_by="self_cuda_time_total"))
5656
else:
5757
start_time = time.perf_counter()
5858
llm.generate(dummy_prompts,

docs/Makefile

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,3 +18,7 @@ help:
1818
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
1919
%: Makefile
2020
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
21+
22+
clean:
23+
@$(SPHINXBUILD) -M clean "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
24+
rm -rf "$(SOURCEDIR)/getting_started/examples"

docs/requirements-docs.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ sphinx-book-theme==1.0.1
33
sphinx-copybutton==0.5.2
44
myst-parser==3.0.1
55
sphinx-argparse==0.4.0
6+
sphinx-togglebutton==0.3.2
67
msgspec
78
cloudpickle
89

docs/source/community/sponsors.md

Lines changed: 10 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,26 +5,32 @@ vLLM is a community project. Our compute resources for development and testing a
55
<!-- Note: Please sort them in alphabetical order. -->
66
<!-- Note: Please keep these consistent with README.md. -->
77

8+
Cash Donations:
89
- a16z
10+
- Dropbox
11+
- Sequoia Capital
12+
- Skywork AI
13+
- ZhenFund
14+
15+
Compute Resources:
916
- AMD
1017
- Anyscale
1118
- AWS
1219
- Crusoe Cloud
1320
- Databricks
1421
- DeepInfra
15-
- Dropbox
1622
- Google Cloud
1723
- Lambda Lab
1824
- Nebius
25+
- Novita AI
1926
- NVIDIA
2027
- Replicate
2128
- Roblox
2229
- RunPod
23-
- Sequoia Capital
24-
- Skywork AI
2530
- Trainy
2631
- UC Berkeley
2732
- UC San Diego
28-
- ZhenFund
33+
34+
Slack Sponsor: Anyscale
2935

3036
We also have an official fundraising venue through [OpenCollective](https://opencollective.com/vllm). We plan to use the fund to support the development, maintenance, and adoption of vLLM.

docs/source/conf.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,10 @@
4343
"sphinx.ext.autosummary",
4444
"myst_parser",
4545
"sphinxarg.ext",
46+
"sphinx_togglebutton",
47+
]
48+
myst_enable_extensions = [
49+
"colon_fence",
4650
]
4751

4852
# Add any paths that contain templates here, relative to this directory.

docs/source/features/spec_decode.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,6 +159,72 @@ A variety of speculative models of this type are available on HF hub:
159159
- [granite-7b-instruct-accelerator](https://huggingface.co/ibm-granite/granite-7b-instruct-accelerator)
160160
- [granite-20b-code-instruct-accelerator](https://huggingface.co/ibm-granite/granite-20b-code-instruct-accelerator)
161161

162+
## Speculating using EAGLE based draft models
163+
164+
The following code configures vLLM to use speculative decoding where proposals are generated by
165+
an [EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency)](https://arxiv.org/pdf/2401.15077) based draft model.
166+
167+
```python
168+
from vllm import LLM, SamplingParams
169+
170+
prompts = [
171+
"The future of AI is",
172+
]
173+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
174+
175+
llm = LLM(
176+
model="meta-llama/Meta-Llama-3-8B-Instruct",
177+
tensor_parallel_size=4,
178+
speculative_model="path/to/modified/eagle/model",
179+
speculative_draft_tensor_parallel_size=1,
180+
)
181+
182+
outputs = llm.generate(prompts, sampling_params)
183+
184+
for output in outputs:
185+
prompt = output.prompt
186+
generated_text = output.outputs[0].text
187+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
188+
189+
```
190+
191+
A few important things to consider when using the EAGLE based draft models:
192+
193+
1. The EAGLE draft models available in the [HF repository for EAGLE models](https://huggingface.co/yuhuili) cannot be
194+
used directly with vLLM due to differences in the expected layer names and model definition.
195+
To use these models with vLLM, use the [following script](https://gist.github.com/abhigoyal1997/1e7a4109ccb7704fbc67f625e86b2d6d)
196+
to convert them. Note that this script does not modify the model's weights.
197+
198+
In the above example, use the script to first convert
199+
the [yuhuili/EAGLE-LLaMA3-Instruct-8B](https://huggingface.co/yuhuili/EAGLE-LLaMA3-Instruct-8B) model
200+
and then use the converted checkpoint as the draft model in vLLM.
201+
202+
2. The EAGLE based draft models need to be run without tensor parallelism
203+
(i.e. speculative_draft_tensor_parallel_size is set to 1), although
204+
it is possible to run the main model using tensor parallelism (see example above).
205+
206+
3. When using EAGLE-based speculators with vLLM, the observed speedup is lower than what is
207+
reported in the reference implementation [here](https://github.com/SafeAILab/EAGLE). This issue is under
208+
investigation and tracked here: [https://github.com/vllm-project/vllm/issues/9565](https://github.com/vllm-project/vllm/issues/9565).
209+
210+
211+
A variety of EAGLE draft models are available on the Hugging Face hub:
212+
213+
| Base Model | EAGLE on Hugging Face | # EAGLE Parameters |
214+
|---------------------------------------------------------------------|-------------------------------------------|--------------------|
215+
| Vicuna-7B-v1.3 | yuhuili/EAGLE-Vicuna-7B-v1.3 | 0.24B |
216+
| Vicuna-13B-v1.3 | yuhuili/EAGLE-Vicuna-13B-v1.3 | 0.37B |
217+
| Vicuna-33B-v1.3 | yuhuili/EAGLE-Vicuna-33B-v1.3 | 0.56B |
218+
| LLaMA2-Chat 7B | yuhuili/EAGLE-llama2-chat-7B | 0.24B |
219+
| LLaMA2-Chat 13B | yuhuili/EAGLE-llama2-chat-13B | 0.37B |
220+
| LLaMA2-Chat 70B | yuhuili/EAGLE-llama2-chat-70B | 0.99B |
221+
| Mixtral-8x7B-Instruct-v0.1 | yuhuili/EAGLE-mixtral-instruct-8x7B | 0.28B |
222+
| LLaMA3-Instruct 8B | yuhuili/EAGLE-LLaMA3-Instruct-8B | 0.25B |
223+
| LLaMA3-Instruct 70B | yuhuili/EAGLE-LLaMA3-Instruct-70B | 0.99B |
224+
| Qwen2-7B-Instruct | yuhuili/EAGLE-Qwen2-7B-Instruct | 0.26B |
225+
| Qwen2-72B-Instruct | yuhuili/EAGLE-Qwen2-72B-Instruct | 1.05B |
226+
227+
162228
## Lossless guarantees of Speculative Decoding
163229

164230
In vLLM, speculative decoding aims to enhance inference efficiency while maintaining accuracy. This section addresses the lossless guarantees of

0 commit comments

Comments
 (0)