Skip to content

Commit 8d1f28b

Browse files
committed
Add docs for disabling unused modalities
Signed-off-by: DarkLight1337 <[email protected]>
1 parent 0a038dc commit 8d1f28b

File tree

1 file changed

+24
-0
lines changed

1 file changed

+24
-0
lines changed

docs/source/serving/offline_inference.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,30 @@ If you run out of CPU RAM, try the following options:
110110
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
111111
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
112112

113+
#### Disable unused modalities
114+
115+
You can disable unused modalities (except for text) by setting its limit to zero.
116+
117+
For example, if your application only accepts image input, there is no need to allocate any memory for videos.
118+
119+
```python
120+
from vllm import LLM
121+
122+
# Accept images but not videos
123+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
124+
limit_mm_per_prompt={"video": 0})
125+
```
126+
127+
You can even run a multi-modal model for text-only inference:
128+
129+
```python
130+
from vllm import LLM
131+
132+
# Don't accept images. Just text.
133+
llm = LLM(model="google/gemma-3-27b-it",
134+
limit_mm_per_prompt={"image": 0})
135+
```
136+
113137
### Performance optimization and tuning
114138

115139
You can potentially improve the performance of vLLM by finetuning various options.

0 commit comments

Comments
 (0)