Skip to content

Commit 22677bc

Browse files
committed
Reorganize adding model docs
Signed-off-by: DarkLight1337 <[email protected]>
1 parent 024cd90 commit 22677bc

File tree

9 files changed

+174
-164
lines changed

9 files changed

+174
-164
lines changed

docs/source/contributing/model/adding_model.md

Lines changed: 0 additions & 155 deletions
This file was deleted.
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
(new-model-basic)=
2+
3+
# Basic Implementation
4+
5+
This guide walks you through the steps to implement a basic vLLM model.
6+
7+
## 1. Bring your model code
8+
9+
Start by forking our [GitHub repository](https://github.com/vllm-project/vllm) and then [build it from source](#build-from-source).
10+
This gives you the ability to modify the codebase and test your model.
11+
12+
Clone the PyTorch model code from the HuggingFace Transformers repository and put it into the <gh-dir:vllm/model_executor/models> directory.
13+
For instance, vLLM's [OPT model](gh-file:vllm/model_executor/models/opt.py) was adapted from the HuggingFace's [modeling_opt.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py) file.
14+
15+
```{warning}
16+
When copying the model code, make sure to review and adhere to the code's copyright and licensing terms.
17+
```
18+
19+
```{tip}
20+
If you don't want to fork the repository and modify vLLM's codebase, please refer to [Out-of-Tree Model Integration](#new-model-oot).
21+
```
22+
23+
## 2. Make your code compatible with vLLM
24+
25+
To ensure compatibility with vLLM, your model must meet the following requirements:
26+
27+
### Initialization Code
28+
29+
All vLLM modules within the model must include a `prefix` argument in their constructor. This `prefix` is typically the full name of the module in the model's state dictionary and is crucial for:
30+
31+
- Runtime support: vLLM's attention operators are registered in a model's state by their full names. Each attention operator must have a unique prefix as its layer name to avoid conflicts.
32+
- Non-uniform quantization support: A quantized checkpoint can selectively quantize certain layers while keeping others in full precision. By providing the `prefix` during initialization, vLLM can match the current layer's `prefix` with the quantization configuration to determine if the layer should be initialized in quantized mode.
33+
34+
The initialization code should look like this:
35+
36+
```python
37+
from torch import nn
38+
from vllm.config import VllmConfig
39+
from vllm.attention import Attention
40+
41+
class MyAttention(nn.Module):
42+
def __init__(self, vllm_config: VllmConfig, prefix: str):
43+
super().__init__()
44+
self.attn = Attention(prefix=f"{prefix}.attn")
45+
46+
class MyDecoderLayer(nn.Module):
47+
def __init__(self, vllm_config: VllmConfig, prefix: str):
48+
super().__init__()
49+
self.self_attn = MyAttention(prefix=f"{prefix}.self_attn")
50+
51+
class MyModel(nn.Module):
52+
def __init__(self, vllm_config: VllmConfig, prefix: str):
53+
super().__init__()
54+
self.layers = nn.ModuleList(
55+
[MyDecoderLayer(vllm_config, prefix=f"{prefix}.layers.{i}") for i in range(vllm_config.model_config.hf_config.num_hidden_layers)]
56+
)
57+
58+
class MyModelForCausalLM(nn.Module):
59+
def __init__(self, vllm_config: VllmConfig, prefix: str = ""):
60+
super().__init__()
61+
self.model = MyModel(vllm_config, prefix=f"{prefix}.model")
62+
```
63+
64+
### Computation Code
65+
66+
Rewrite the {meth}`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
67+
68+
```python
69+
def forward(
70+
self,
71+
input_ids: torch.Tensor,
72+
positions: torch.Tensor,
73+
kv_caches: List[torch.Tensor],
74+
attn_metadata: AttentionMetadata,
75+
) -> torch.Tensor:
76+
...
77+
```
78+
79+
```{note}
80+
Currently, vLLM supports the basic multi-head attention mechanism and its variant with rotary positional embeddings.
81+
If your model employs a different attention mechanism, you will need to implement a new attention layer in vLLM.
82+
```
83+
84+
For reference, check out our [Llama implementation](gh-file:vllm/model_executor/models/llama.py). vLLM already supports a large number of models. It is recommended to find a model similar to yours and adapt it to your model's architecture. Check out <gh-dir:vllm/model_executor/models> for more examples.
85+
86+
## 3. (Optional) Implement tensor parallelism and quantization support
87+
88+
If your model is too large to fit into a single GPU, you can use tensor parallelism to manage it.
89+
To do this, substitute your model's linear and embedding layers with their tensor-parallel versions.
90+
For the embedding layer, you can simply replace {class}`torch.nn.Embedding` with `VocabParallelEmbedding`. For the output LM head, you can use `ParallelLMHead`.
91+
When it comes to the linear layers, we provide the following options to parallelize them:
92+
93+
- `ReplicatedLinear`: Replicates the inputs and weights across multiple GPUs. No memory saving.
94+
- `RowParallelLinear`: The input tensor is partitioned along the hidden dimension. The weight matrix is partitioned along the rows (input dimension). An *all-reduce* operation is performed after the matrix multiplication to reduce the results. Typically used for the second FFN layer and the output linear transformation of the attention layer.
95+
- `ColumnParallelLinear`: The input tensor is replicated. The weight matrix is partitioned along the columns (output dimension). The result is partitioned along the column dimension. Typically used for the first FFN layer and the separated QKV transformation of the attention layer in the original Transformer.
96+
- `MergedColumnParallelLinear`: Column-parallel linear that merges multiple `ColumnParallelLinear` operators. Typically used for the first FFN layer with weighted activation functions (e.g., SiLU). This class handles the sharded weight loading logic of multiple weight matrices.
97+
- `QKVParallelLinear`: Parallel linear layer for the query, key, and value projections of the multi-head and grouped-query attention mechanisms. When number of key/value heads are less than the world size, this class replicates the key/value heads properly. This class handles the weight loading and replication of the weight matrices.
98+
99+
Note that all the linear layers above take `linear_method` as an input. vLLM will set this parameter according to different quantization schemes to support weight quantization.
100+
101+
## 4. Implement the weight loading logic
102+
103+
You now need to implement the `load_weights` method in your `*ForCausalLM` class.
104+
This method should load the weights from the HuggingFace's checkpoint file and assign them to the corresponding layers in your model. Specifically, for `MergedColumnParallelLinear` and `QKVParallelLinear` layers, if the original model has separated weight matrices, you need to load the different parts separately.
105+
106+
## 5. Register your model
107+
108+
Finally, add your `*ForCausalLM` class to `_VLLM_MODELS` in <gh-file:vllm/model_executor/models/registry.py> so that it is available by default.
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
(new-model)=
2+
3+
# Adding a New Model
4+
5+
This section provides more information on how to integrate a [HuggingFace Transformers](https://github.com/huggingface/transformers) model into vLLM.
6+
7+
```{toctree}
8+
:caption: Contents
9+
:maxdepth: 1
10+
11+
basic
12+
multimodal
13+
oot
14+
```
15+
16+
```{note}
17+
The complexity of adding a new model depends heavily on the model's architecture.
18+
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
19+
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.
20+
```
21+
22+
```{tip}
23+
If you are encountering issues while integrating your model into vLLM, feel free to open a [GitHub issue](https://github.com/vllm-project/vllm/issues)
24+
or ask on our [developer slack](https://slack.vllm.ai).
25+
We will be happy to help you out!
26+
```

docs/source/contributing/model/enabling_multimodal_inputs.md renamed to docs/source/contributing/model/multimodal.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,11 @@
22

33
# Enabling Multimodal Inputs
44

5-
This document walks you through the steps to extend a vLLM model so that it accepts [multi-modal inputs](#multimodal-inputs).
6-
7-
```{seealso}
8-
[Adding a New Model](adding-a-new-model)
9-
```
5+
This document walks you through the steps to extend a basic model so that it accepts [multi-modal inputs](#multimodal-inputs).
106

117
## 1. Update the base vLLM model
128

13-
It is assumed that you have already implemented the model in vLLM according to [these steps](#adding-a-new-model).
9+
It is assumed that you have already implemented the model in vLLM according to [these steps](#new-model-basic).
1410
Further update the model as follows:
1511

1612
- Implement the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.

docs/source/contributing/model/oot.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
(new-model-oot)=
2+
3+
# Out-of-Tree Model Integration
4+
5+
You can integrate a model using a plugin without modifying the vLLM codebase.
6+
7+
```{seealso}
8+
[vLLM's Plugin System](#plugin-system)
9+
```
10+
11+
To register the model, use the following code:
12+
13+
```python
14+
from vllm import ModelRegistry
15+
from your_code import YourModelForCausalLM
16+
ModelRegistry.register_model("YourModelForCausalLM", YourModelForCausalLM)
17+
```
18+
19+
If your model imports modules that initialize CUDA, consider lazy-importing it to avoid errors like `RuntimeError: Cannot re-initialize CUDA in forked subprocess`:
20+
21+
```python
22+
from vllm import ModelRegistry
23+
24+
ModelRegistry.register_model("YourModelForCausalLM", "your_code:YourModelForCausalLM")
25+
```
26+
27+
```{important}
28+
If your model is a multimodal model, ensure the model class implements the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
29+
Read more about that [here](#enabling-multimodal-inputs).
30+
```
31+
32+
```{note}
33+
Although you can directly put these code snippets in your script using `vllm.LLM`, the recommended way is to place these snippets in a vLLM plugin. This ensures compatibility with various vLLM features like distributed inference and the API server.
34+
```

docs/source/dev/offline_inference/offline_index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# Offline Inference
22

33
```{toctree}
4+
:caption: Contents
45
:maxdepth: 1
56
67
llm

docs/source/features/quantization/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
Quantization trades off model precision for smaller memory footprint, allowing large models to be run on a wider range of devices.
66

77
```{toctree}
8+
:caption: Contents
89
:maxdepth: 1
910
1011
supported_hardware

docs/source/index.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -161,8 +161,7 @@ design/multiprocessing
161161
contributing/overview
162162
contributing/profiling/profiling_index
163163
contributing/dockerfile/dockerfile
164-
contributing/model/adding_model
165-
contributing/model/enabling_multimodal_inputs
164+
contributing/model/index
166165
```
167166

168167
# Indices and tables

docs/source/models/supported_models.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ print(output)
3737
If vLLM successfully returns text (for generative models) or hidden states (for pooling models), it indicates that your model is supported.
3838
````
3939

40-
Otherwise, please refer to [Adding a New Model](#adding-a-new-model) and [Enabling Multimodal Inputs](#enabling-multimodal-inputs) for instructions on how to implement your model in vLLM.
40+
Otherwise, please refer to [Adding a New Model](#new-model) for instructions on how to implement your model in vLLM.
4141
Alternatively, you can [open an issue on GitHub](https://github.com/vllm-project/vllm/issues/new/choose) to request vLLM support.
4242

4343
### ModelScope

0 commit comments

Comments
 (0)