You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/contributing/model/basic.md
+11-1Lines changed: 11 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -57,7 +57,17 @@ class MyModelForCausalLM(nn.Module):
57
57
58
58
### Computation Code
59
59
60
-
Rewrite the {meth}`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
60
+
- Add a `get_input_embeddings` method inside `MyModel` module that returns the text embeddings given `input_ids`. This is equivalent to directly calling the text embedding layer, but provides a unified interface in case `MyModel` is used within a composite multimodal model.
- Rewrite the {meth}`~torch.nn.Module.forward` method of your model to remove any unnecessary code, such as training-specific code. Modify the input parameters to treat `input_ids` and `positions` as flattened tensors with a single batch size dimension, without a max-sequence length dimension.
Copy file name to clipboardExpand all lines: docs/source/contributing/model/multimodal.md
+72-15Lines changed: 72 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -9,7 +9,78 @@ This document walks you through the steps to extend a basic model so that it acc
9
9
It is assumed that you have already implemented the model in vLLM according to [these steps](#new-model-basic).
10
10
Further update the model as follows:
11
11
12
-
- Implement the {class}`~vllm.model_executor.models.interfaces.SupportsMultiModal` interface.
12
+
- Reserve a keyword parameter in {meth}`~torch.nn.Module.forward` for each input tensor that corresponds to a multi-modal input, as shown in the following example:
13
+
14
+
```diff
15
+
def forward(
16
+
self,
17
+
input_ids: torch.Tensor,
18
+
positions: torch.Tensor,
19
+
kv_caches: List[torch.Tensor],
20
+
attn_metadata: AttentionMetadata,
21
+
+ pixel_values: torch.Tensor,
22
+
) -> SamplerOutput:
23
+
```
24
+
25
+
More conveniently, you can simply pass `**kwargs` to the {meth}`~torch.nn.Module.forward` method and retrieve the keyword parameters for multimodal inputs from it.
26
+
27
+
- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_multimodal_embeddings` that returns the embeddings from running the multimodal inputs through the multimodal tokenizer of the model. Below we provide a boilerplate of a typical implementation pattern, but feel free to adjust it to your own needs.
The returned `multimodal_embeddings` must be either a **3D {class}`torch.Tensor`** of shape `(num_items, feature_size, hidden_size)`, or a **list/tuple of 2D {class}`torch.Tensor`'s** of shape `(feature_size, hidden_size)`, so that `multimodal_embeddings[i]` retrieves the embeddings generated from the `i`-th multimodal data item (e.g, image) of the request.
53
+
```
54
+
55
+
- Implement {meth}`~vllm.model_executor.models.interfaces.SupportsMultiModal.get_input_embeddings` to merge `multimodal_embeddings`with text embeddings from the `input_ids`. If input processing for the model is implemented correctly (see sections below), then you can leverage the utility function we provide to easily merge the embeddings.
0 commit comments