Skip to content

Commit 1c25c63

Browse files
alex-jw-brookswuisawesome
authored andcommitted
[Doc] Split dummy_processor_inputs() in Multimodal Docs (vllm-project#16915)
Signed-off-by: Alex-Brooks <[email protected]>
1 parent c17e677 commit 1c25c63

File tree

2 files changed

+35
-29
lines changed

2 files changed

+35
-29
lines changed

docs/source/contributing/model/multimodal.md

Lines changed: 34 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -128,11 +128,9 @@ HF processing as well as memory profiling.
128128

129129
### For memory profiling
130130

131-
Override the abstract method {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_processor_inputs`
132-
to construct dummy inputs for memory profiling. This dummy input should result in the worst-case memory usage of
133-
the model so that vLLM can reserve the correct amount of memory for it.
131+
Override the abstract methods {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text` and {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_mm_data` to construct dummy inputs for memory profiling. These dummy inputs should result in the worst-case memory usage of the model so that vLLM can reserve the correct amount of memory for it.
134132

135-
Assuming that the memory usage increases with the number of tokens, the dummy input can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens.
133+
Assuming that the memory usage increases with the number of tokens, the dummy inputs can be constructed to maximize the number of output embeddings, which is the same number as placeholder feature tokens.
136134

137135
::::{tab-set}
138136
:::{tab-item} Basic example: LLaVA
@@ -244,38 +242,45 @@ def get_num_image_tokens(
244242
```
245243

246244
Notice that the number of image tokens doesn't depend on the image width and height.
247-
We can simply use a dummy `image_size`:
245+
We can simply use a dummy `image_size` to calculate the multimodal profiling data:
248246

249247
```python
248+
# NOTE: In actuality, this is usually implemented as part of the
249+
# model's subclass of `BaseProcessingInfo`, but we show it as is
250+
# here for simplicity.
250251
def get_image_size_with_most_features(self) -> ImageSize:
251252
hf_config = self.get_hf_config()
252253
width = height = hf_config.image_size
253254
return ImageSize(width=width, height=height)
254255

255-
def get_dummy_processor_inputs(
256+
def get_dummy_mm_data(
256257
self,
257258
seq_len: int,
258259
mm_counts: Mapping[str, int],
259-
) -> ProcessorInputs:
260+
) -> MultiModalDataDict:
260261
num_images = mm_counts.get("image", 0)
261262

262-
processor = self.info.get_hf_processor()
263-
image_token = processor.image_token
264-
265-
hf_config = self.get_hf_config()
266-
target_width, target_height = self.info.get_image_size_with_most_features()
263+
target_width, target_height = \
264+
self.info.get_image_size_with_most_features()
267265

268-
mm_data = {
266+
return {
269267
"image":
270268
self._get_dummy_images(width=target_width,
271269
height=target_height,
272270
num_images=num_images)
273271
}
272+
```
274273

275-
return ProcessorInputs(
276-
prompt_text=image_token * num_images,
277-
mm_data=mm_data,
278-
)
274+
For the text, we simply expand the multimodal image token from the model config to match the desired number of images.
275+
276+
```python
277+
def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
278+
num_images = mm_counts.get("image", 0)
279+
280+
processor = self.info.get_hf_processor()
281+
image_token = processor.image_token
282+
283+
return image_token * num_images
279284
```
280285

281286
:::
@@ -412,29 +417,30 @@ def get_image_size_with_most_features(self) -> ImageSize:
412417

413418
Fuyu does not expect image placeholders in the inputs to HF processor, so
414419
the dummy prompt text is empty regardless of the number of images.
415-
Otherwise, the logic of this method is very similar to LLaVA:
416420

417421
```python
418-
def get_dummy_processor_inputs(
422+
def get_dummy_text(self, mm_counts: Mapping[str, int]) -> str:
423+
return ""
424+
```
425+
426+
For the multimodal image profiling data, the logic is very similar to LLaVA:
427+
428+
```python
429+
def get_dummy_mm_data(
419430
self,
420431
seq_len: int,
421432
mm_counts: Mapping[str, int],
422-
) -> ProcessorInputs:
433+
) -> MultiModalDataDict:
423434
target_width, target_height = \
424435
self.info.get_image_size_with_most_features()
425436
num_images = mm_counts.get("image", 0)
426437

427-
mm_data = {
438+
return {
428439
"image":
429440
self._get_dummy_images(width=target_width,
430-
height=target_height,
431-
num_images=num_images)
441+
height=target_height,
442+
num_images=num_images)
432443
}
433-
434-
return ProcessorInputs(
435-
prompt_text="",
436-
mm_data=mm_data,
437-
)
438444
```
439445

440446
:::

docs/source/design/mm_processing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ Moreover, since the tokenized text has not passed through the HF processor, we h
4747

4848
### Dummy text
4949

50-
We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_processor_inputs`. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
50+
We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_text`. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
5151

5252
(mm-automatic-prompt-updating)=
5353

0 commit comments

Comments
 (0)