You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/models/supported_models.md
+26-22Lines changed: 26 additions & 22 deletions
Original file line number
Diff line number
Diff line change
@@ -40,33 +40,37 @@ You can force the use of `TransformersForCausalLM` by setting `model_impl="trans
40
40
vLLM may not fully optimise the Transformers implementation so you may see degraded performance if comparing a native model to a Transformers model in vLLM.
41
41
:::
42
42
43
-
#### Supported features
43
+
#### Custom models
44
44
45
-
The Transformers modeling backend explicitly supports the following features:
45
+
If a model is neither supported natively by vLLM or Transformers, it can still be used in vLLM!
46
46
47
-
-<project:#quantization-index> (except GGUF)
48
-
-<project:#lora-adapter>
49
-
-<project:#distributed-serving>
47
+
For a model to be compatible with the Transformers backend for vLLM it must:
50
48
51
-
#### Remote Code
49
+
- be a Transformers compatible custom model (see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)):
50
+
* The model directory must have the correct structure (e.g. `config.json` is present).
51
+
*`config.json` must contain `auto_map.AutoModel`.
52
+
- be a Transformers backend for vLLM compatible model (see <project:#writing-custom-models>):
53
+
* Customisation should be done in the base model (e.g. in `MyModel`, not `MyModelForCausalLM`).
52
54
53
-
If your model is neither supported natively by vLLM or Transformers, you can still run it in vLLM!
55
+
If the compatible model is:
54
56
55
-
Simply set `trust_remote_code=True`and vLLM will run any model on the Model Hub that is compatible with Transformers.
56
-
Provided that the model writer implements their model in a compatible way, this means that you can run new models before they are officially supported in Transformers or vLLM!
57
+
- on the Hugging Face Model Hub, simply set `trust_remote_code=True`for <project:#offline-inference> or `--trust-remode-code` for the <project:#openai-compatible-server>.
58
+
-in a local directory, simply pass directory path to `model=<MODEL_DIR>` for <project:#offline-inference> or `vllm serve <MODEL_DIR>` for the <project:#openai-compatible-server>.
57
59
58
-
:::{tip}
59
-
If you have not yet created your custom model, you can follow this guide on [customising models in Transformers](https://huggingface.co/docs/transformers/en/custom_models).
60
-
:::
60
+
This means that, with the Transformers backend for vLLM, new models can be used before they are officially supported in Transformers or vLLM!
61
61
62
-
```python
63
-
from vllm importLLM
64
-
llm = LLM(model=..., task="generate", trust_remote_code=True) # Name or path of your model
This section details the necessary modifications to make to a Transformers compatible custom model that make it compatible with the Transformers backend for vLLM. (We assume that a Transformers compatible custom model has already been created, see [Transformers - Customizing models](https://huggingface.co/docs/transformers/en/custom_models)).
67
67
68
68
To make your model compatible with the Transformers backend, it needs:
69
69
70
+
1.`kwargs` passed down through all modules from `MyModel` to `MyAttention`.
71
+
2.`MyAttention` must use `ALL_ATTENTION_FUNCTIONS` to call attention.
72
+
3.`MyModel` must contain `_supports_attention_backend = True`.
73
+
70
74
```{code-block} python
71
75
:caption: modeling_my_model.py
72
76
@@ -75,7 +79,7 @@ from torch import nn
75
79
76
80
class MyAttention(nn.Module):
77
81
78
-
def forward(self, hidden_states, **kwargs): # <- kwargs are required
@@ -91,11 +95,11 @@ class MyModel(PreTrainedModel):
91
95
_supports_attention_backend = True
92
96
```
93
97
94
-
Here is what happens in the background:
98
+
Here is what happens in the background when this model is loaded:
95
99
96
-
1. The config is loaded
97
-
2.`MyModel` Python class is loaded from the `auto_map`, and we check that the model `_supports_attention_backend`.
98
-
3.The `TransformersForCausalLM` backend is used. See <gh-file:vllm/model_executor/models/transformers.py>, which leverage`self.config._attn_implementation = "vllm"`, thus the need to use `ALL_ATTENTION_FUNCTION`.
100
+
1. The config is loaded.
101
+
2.`MyModel` Python class is loaded from the `auto_map` in config, and we check that the model `is_backend_compatible()`.
102
+
3.`MyModel`is loaded into `TransformersForCausalLM` (see <gh-file:vllm/model_executor/models/transformers.py>) which sets`self.config._attn_implementation = "vllm"` so that vLLM's attention layer is used.
0 commit comments