Skip to content

Commit 1661a9c

Browse files
authored
[Doc][Neuron] Update documentation for Neuron (#18868)
Signed-off-by: Elaine Zhao <[email protected]>
1 parent 8e882ff commit 1661a9c

File tree

4 files changed

+100
-95
lines changed

4 files changed

+100
-95
lines changed

docs/features/compatibility_matrix.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,3 +75,6 @@ th:not(:first-child) {
7575
| multi-step |||||| [](gh-issue:8477) ||
7676
| best-of ||||||||
7777
| beam-search ||||||||
78+
79+
!!! note
80+
Please refer to [Feature support through NxD Inference backend][feature-support-through-nxd-inference-backend] for features supported on AWS Neuron hardware

docs/features/quantization/supported_hardware.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ title: Supported Hardware
55

66
The table below shows the compatibility of various quantization implementations with different hardware platforms in vLLM:
77

8-
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Inferentia | Google TPU |
8+
| Implementation | Volta | Turing | Ampere | Ada | Hopper | AMD GPU | Intel GPU | x86 CPU | AWS Neuron | Google TPU |
99
|-----------------------|---------|----------|----------|-------|----------|-----------|-------------|-----------|------------------|--------------|
1010
| AWQ || ✅︎ | ✅︎ | ✅︎ | ✅︎ || ✅︎ | ✅︎ |||
1111
| GPTQ | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ || ✅︎ | ✅︎ |||
1212
| Marlin (GPTQ/AWQ/FP8) ||| ✅︎ | ✅︎ | ✅︎ ||||||
13-
| INT8 (W8A8) || ✅︎ | ✅︎ | ✅︎ | ✅︎ ||| ✅︎ | | ✅︎ |
14-
| FP8 (W8A8) |||| ✅︎ | ✅︎ | ✅︎ ||| ||
13+
| INT8 (W8A8) || ✅︎ | ✅︎ | ✅︎ | ✅︎ ||| ✅︎ | ✅︎ | ✅︎ |
14+
| FP8 (W8A8) |||| ✅︎ | ✅︎ | ✅︎ ||| ✅︎ ||
1515
| BitBLAS (GPTQ) | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ ||||||
1616
| AQLM | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ ||||||
1717
| bitsandbytes | ✅︎ | ✅︎ | ✅︎ | ✅︎ | ✅︎ ||||||
Lines changed: 93 additions & 91 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
# --8<-- [start:installation]
22

3-
vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
4-
Paged Attention and Chunked Prefill are currently in development and will be available soon.
5-
Data types currently supported in Neuron SDK are FP16 and BF16.
3+
[AWS Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/) is the software development kit (SDK) used to run deep learning and
4+
generative AI workloads on AWS Inferentia and AWS Trainium powered Amazon EC2 instances and UltraServers (Inf1, Inf2, Trn1, Trn2,
5+
and Trn2 UltraServer). Both Trainium and Inferentia are powered by fully-independent heterogeneous compute-units called NeuronCores.
6+
This tab describes how to set up your environment to run vLLM on Neuron.
67

78
!!! warning
89
There are no pre-built wheels or images for this device, so you must build vLLM from source.
@@ -11,58 +12,30 @@ Data types currently supported in Neuron SDK are FP16 and BF16.
1112
# --8<-- [start:requirements]
1213

1314
- OS: Linux
14-
- Python: 3.9 -- 3.11
15-
- Accelerator: NeuronCore_v2 (in trn1/inf2 instances)
16-
- Pytorch 2.0.1/2.1.1
17-
- AWS Neuron SDK 2.16/2.17 (Verified on python 3.8)
15+
- Python: 3.9 or newer
16+
- Pytorch 2.5/2.6
17+
- Accelerator: NeuronCore-v2 (in trn1/inf2 chips) or NeuronCore-v3 (in trn2 chips)
18+
- AWS Neuron SDK 2.23
1819

1920
## Configure a new environment
2021

21-
### Launch Trn1/Inf2 instances
22+
### Launch a Trn1/Trn2/Inf2 instance and verify Neuron dependencies
2223

23-
Here are the steps to launch trn1/inf2 instances, in order to install [PyTorch Neuron ("torch-neuronx") Setup on Ubuntu 22.04 LTS](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html).
24+
The easiest way to launch a Trainium or Inferentia instance with pre-installed Neuron dependencies is to follow this
25+
[quick start guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/multiframework/multi-framework-ubuntu22-neuron-dlami.html#setup-ubuntu22-multi-framework-dlami) using the Neuron Deep Learning AMI (Amazon machine image).
2426

25-
- Please follow the instructions at [launch an Amazon EC2 Instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html#ec2-launch-instance) to launch an instance. When choosing the instance type at the EC2 console, please make sure to select the correct instance type.
26-
- To get more information about instances sizes and pricing see: [Trn1 web page](https://aws.amazon.com/ec2/instance-types/trn1/), [Inf2 web page](https://aws.amazon.com/ec2/instance-types/inf2/)
27-
- Select Ubuntu Server 22.04 TLS AMI
28-
- When launching a Trn1/Inf2, please adjust your primary EBS volume size to a minimum of 512GB.
2927
- After launching the instance, follow the instructions in [Connect to your instance](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AccessingInstancesLinux.html) to connect to the instance
30-
31-
### Install drivers and tools
32-
33-
The installation of drivers and tools wouldn't be necessary, if [Deep Learning AMI Neuron](https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html) is installed. In case the drivers and tools are not installed on the operating system, follow the steps below:
34-
28+
- Once inside your instance, activate the pre-installed virtual environment for inference by running
3529
```console
36-
# Configure Linux for Neuron repository updates
37-
. /etc/os-release
38-
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
39-
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
40-
EOF
41-
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB \
42-
| sudo apt-key add -
43-
44-
# Update OS packages
45-
sudo apt-get update -y
46-
47-
# Install OS headers
48-
sudo apt-get install linux-headers-$(uname -r) -y
49-
50-
# Install git
51-
sudo apt-get install git -y
52-
53-
# install Neuron Driver
54-
sudo apt-get install aws-neuronx-dkms=2.* -y
55-
56-
# Install Neuron Runtime
57-
sudo apt-get install aws-neuronx-collectives=2.* -y
58-
sudo apt-get install aws-neuronx-runtime-lib=2.* -y
30+
source /opt/aws_neuronx_venv_pytorch_2_6_nxd_inference/bin/activate
31+
```
5932

60-
# Install Neuron Tools
61-
sudo apt-get install aws-neuronx-tools=2.* -y
33+
Refer to the [NxD Inference Setup Guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/nxdi-setup.html)
34+
for alternative setup instructions including using Docker and manually installing dependencies.
6235

63-
# Add PATH
64-
export PATH=/opt/aws/neuron/bin:$PATH
65-
```
36+
!!! note
37+
NxD Inference is the default recommended backend to run inference on Neuron. If you are looking to use the legacy [transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx)
38+
library, refer to [Transformers NeuronX Setup](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/setup/index.html).
6639

6740
# --8<-- [end:requirements]
6841
# --8<-- [start:set-up-using-python]
@@ -75,60 +48,37 @@ Currently, there are no pre-built Neuron wheels.
7548
# --8<-- [end:pre-built-wheels]
7649
# --8<-- [start:build-wheel-from-source]
7750

78-
!!! note
79-
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
80-
81-
Following instructions are applicable to Neuron SDK 2.16 and beyond.
82-
83-
#### Install transformers-neuronx and its dependencies
51+
#### Install vLLM from source
8452

85-
[transformers-neuronx](https://github.com/aws-neuron/transformers-neuronx) will be the backend to support inference on trn1/inf2 instances.
86-
Follow the steps below to install transformer-neuronx package and its dependencies.
53+
Install vllm as follows:
8754

8855
```console
89-
# Install Python venv
90-
sudo apt-get install -y python3.10-venv g++
91-
92-
# Create Python venv
93-
python3.10 -m venv aws_neuron_venv_pytorch
94-
95-
# Activate Python venv
96-
source aws_neuron_venv_pytorch/bin/activate
97-
98-
# Install Jupyter notebook kernel
99-
pip install ipykernel
100-
python3.10 -m ipykernel install \
101-
--user \
102-
--name aws_neuron_venv_pytorch \
103-
--display-name "Python (torch-neuronx)"
104-
pip install jupyter notebook
105-
pip install environment_kernels
106-
107-
# Set pip repository pointing to the Neuron repository
108-
python -m pip config set \
109-
global.extra-index-url \
110-
https://pip.repos.neuron.amazonaws.com
111-
112-
# Install wget, awscli
113-
python -m pip install wget
114-
python -m pip install awscli
115-
116-
# Update Neuron Compiler and Framework
117-
python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx
56+
git clone https://github.com/vllm-project/vllm.git
57+
cd vllm
58+
pip install -U -r requirements/neuron.txt
59+
VLLM_TARGET_DEVICE="neuron" pip install -e .
11860
```
11961

120-
#### Install vLLM from source
62+
AWS Neuron maintains a [Github fork of vLLM](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2) at
63+
[https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2), which contains several features in addition to what's
64+
available on vLLM V0. Please utilize the AWS Fork for the following features:
65+
66+
- Llama-3.2 multi-modal support
67+
- Multi-node distributed inference
12168

122-
Once neuronx-cc and transformers-neuronx packages are installed, we will be able to install vllm as follows:
69+
Refer to [vLLM User Guide for NxD Inference](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/vllm-user-guide.html)
70+
for more details and usage examples.
71+
72+
To install the AWS Neuron fork, run the following:
12373

12474
```console
125-
git clone https://github.com/vllm-project/vllm.git
126-
cd vllm
127-
pip install -U -r requirements/neuron.txt
128-
VLLM_TARGET_DEVICE="neuron" pip install .
75+
git clone -b neuron-2.23-vllm-v0.7.2 https://github.com/aws-neuron/upstreaming-to-vllm.git
76+
cd upstreaming-to-vllm
77+
pip install -r requirements/neuron.txt
78+
VLLM_TARGET_DEVICE="neuron" pip install -e .
12979
```
13080

131-
If neuron packages are detected correctly in the installation process, `vllm-0.3.0+neuron212` will be installed.
81+
Note that the AWS Neuron fork is only intended to support Neuron hardware; compatibility with other hardwares is not tested.
13282

13383
# --8<-- [end:build-wheel-from-source]
13484
# --8<-- [start:set-up-using-docker]
@@ -148,5 +98,57 @@ Make sure to use <gh-file:docker/Dockerfile.neuron> in place of the default Dock
14898
# --8<-- [end:build-image-from-source]
14999
# --8<-- [start:extra-information]
150100

151-
There is no extra information for this device.
101+
[](){ #feature-support-through-nxd-inference-backend }
102+
### Feature support through NxD Inference backend
103+
104+
The current vLLM and Neuron integration relies on either the `neuronx-distributed-inference` (preferred) or `transformers-neuronx` backend
105+
to perform most of the heavy lifting which includes PyTorch model initialization, compilation, and runtime execution. Therefore, most
106+
[features supported on Neuron](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html) are also available via the vLLM integration.
107+
108+
To configure NxD Inference features through the vLLM entrypoint, use the `override_neuron_config` setting. Provide the configs you want to override
109+
as a dictionary (or JSON object when starting vLLM from the CLI). For example, to disable auto bucketing, include
110+
```console
111+
override_neuron_config={
112+
"enable_bucketing":False,
113+
}
114+
```
115+
or when launching vLLM from the CLI, pass
116+
```console
117+
--override-neuron-config "{\"enable_bucketing\":false}"
118+
```
119+
120+
Alternatively, users can directly call the NxDI library to trace and compile your model, then load the pre-compiled artifacts
121+
(via `NEURON_COMPILED_ARTIFACTS` environment variable) in vLLM to run inference workloads.
122+
123+
### Known limitations
124+
125+
- EAGLE speculative decoding: NxD Inference requires the EAGLE draft checkpoint to include the LM head weights from the target model. Refer to this
126+
[guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/feature-guide.html#eagle-checkpoint-compatibility)
127+
for how to convert pretrained EAGLE model checkpoints to be compatible for NxDI.
128+
- Quantization: the native quantization flow in vLLM is not well supported on NxD Inference. It is recommended to follow this
129+
[Neuron quantization guide](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nxd-inference/developer_guides/custom-quantization.html)
130+
to quantize and compile your model using NxD Inference, and then load the compiled artifacts into vLLM.
131+
- Multi-LoRA serving: NxD Inference only supports loading of LoRA adapters at server startup. Dynamic loading of LoRA adapters at
132+
runtime is not currently supported. Refer to [multi-lora example](https://github.com/aws-neuron/upstreaming-to-vllm/blob/neuron-2.23-vllm-v0.7.2/examples/offline_inference/neuron_multi_lora.py)
133+
- Multi-modal support: multi-modal support is only available through the AWS Neuron fork. This feature has not been upstreamed
134+
to vLLM main because NxD Inference currently relies on certain adaptations to the core vLLM logic to support this feature.
135+
- Multi-node support: distributed inference across multiple Trainium/Inferentia instances is only supported on the AWS Neuron fork. Refer
136+
to this [multi-node example](https://github.com/aws-neuron/upstreaming-to-vllm/tree/neuron-2.23-vllm-v0.7.2/examples/neuron/multi_node)
137+
to run. Note that tensor parallelism (distributed inference across NeuronCores) is available in vLLM main.
138+
- Known edge case bug in speculative decoding: An edge case failure may occur in speculative decoding when sequence length approaches
139+
max model length (e.g. when requesting max tokens up to the max model length and ignoring eos). In this scenario, vLLM may attempt
140+
to allocate an additional block to ensure there is enough memory for number of lookahead slots, but since we do not have good support
141+
for paged attention, there isn't another Neuron block for vLLM to allocate. A workaround fix (to terminate 1 iteration early) is
142+
implemented in the AWS Neuron fork but is not upstreamed to vLLM main as it modifies core vLLM logic.
143+
144+
145+
### Environment variables
146+
- `NEURON_COMPILED_ARTIFACTS`: set this environment variable to point to your pre-compiled model artifacts directory to avoid
147+
compilation time upon server initialization. If this variable is not set, the Neuron module will perform compilation and save the
148+
artifacts under `neuron-compiled-artifacts/{unique_hash}/` sub-directory in the model path. If this environment variable is set,
149+
but the directory does not exist, or the contents are invalid, Neuron will also fallback to a new compilation and store the artifacts
150+
under this specified path.
151+
- `NEURON_CONTEXT_LENGTH_BUCKETS`: Bucket sizes for context encoding. (Only applicable to `transformers-neuronx` backend).
152+
- `NEURON_TOKEN_GEN_BUCKETS`: Bucket sizes for token generation. (Only applicable to `transformers-neuronx` backend).
153+
152154
# --8<-- [end:extra-information]

vllm/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -380,7 +380,7 @@ class ModelConfig:
380380
"""Initialize non-default neuron config or override default neuron config
381381
that are specific to Neuron devices, this argument will be used to
382382
configure the neuron config that can not be gathered from the vllm
383-
arguments. e.g. `{"cast_logits_dtype": "bloat16"}`."""
383+
arguments. e.g. `{"cast_logits_dtype": "bfloat16"}`."""
384384
pooler_config: Optional["PoolerConfig"] = field(init=False)
385385
"""Pooler config which controls the behaviour of output pooling in pooling
386386
models."""

0 commit comments

Comments
 (0)