File tree Expand file tree Collapse file tree 1 file changed +24
-0
lines changed Expand file tree Collapse file tree 1 file changed +24
-0
lines changed Original file line number Diff line number Diff line change @@ -110,6 +110,30 @@ If you run out of CPU RAM, try the following options:
110
110
- (Multi-modal models only) you can set the size of multi-modal input cache using ` VLLM_MM_INPUT_CACHE_GIB ` environment variable (default 4 GiB).
111
111
- (CPU backend only) you can set the size of KV cache using ` VLLM_CPU_KVCACHE_SPACE ` environment variable (default 4 GiB).
112
112
113
+ #### Disable unused modalities
114
+
115
+ You can disable unused modalities (except for text) by setting its limit to zero.
116
+
117
+ For example, if your application only accepts image input, there is no need to allocate any memory for videos.
118
+
119
+ ``` python
120
+ from vllm import LLM
121
+
122
+ # Accept images but not videos
123
+ llm = LLM(model = " Qwen/Qwen2.5-VL-3B-Instruct" ,
124
+ limit_mm_per_prompt = {" video" : 0 })
125
+ ```
126
+
127
+ You can even run a multi-modal model for text-only inference:
128
+
129
+ ``` python
130
+ from vllm import LLM
131
+
132
+ # Don't accept images. Just text.
133
+ llm = LLM(model = " google/gemma-3-27b-it" ,
134
+ limit_mm_per_prompt = {" image" : 0 })
135
+ ```
136
+
113
137
### Performance optimization and tuning
114
138
115
139
You can potentially improve the performance of vLLM by finetuning various options.
You can’t perform that action at this time.
0 commit comments