Skip to content

Commit 5460d18

Browse files
feat: trtllm-serve multimodal support (#3590)
* feat: trtllm-serve multimodal support Signed-off-by: yechank <[email protected]> * remove disable argument Signed-off-by: yechank <[email protected]> * remove disable Signed-off-by: yechank <[email protected]> * add and separate tests and move the doc Signed-off-by: yechank <[email protected]> * remove block_resue arg from serve.py Signed-off-by: yechank <[email protected]> --------- Signed-off-by: yechank <[email protected]> Co-authored-by: Haohang Huang <[email protected]>
1 parent ce83296 commit 5460d18

14 files changed

+665
-41
lines changed

docs/source/commands/trtllm-serve.rst

+37
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,43 @@ Another example uses ``curl``:
6666
:language: bash
6767
:linenos:
6868

69+
Multimodal Serving
70+
~~~~~~~~~~~~~~~~~
71+
72+
For multimodal models (e.g., Qwen2-VL), you'll need to create a configuration file and start the server with additional options:
73+
74+
First, create a configuration file:
75+
76+
.. code-block:: bash
77+
78+
cat >./extra-llm-api-config.yml<<EOF
79+
kv_cache_config:
80+
enable_block_reuse: false
81+
EOF
82+
83+
Then, start the server with the configuration file:
84+
85+
.. code-block:: bash
86+
87+
trtllm-serve Qwen/Qwen2-VL-7B-Instruct \
88+
--extra_llm_api_options ./extra-llm-api-config.yml \
89+
--backend pytorch
90+
91+
Completions API
92+
~~~~~~~~~~~~~~~
93+
94+
You can query Completions API with any http clients, a typical example is OpenAI Python client:
95+
96+
.. literalinclude:: ../../../examples/serve/openai_completion_client_for_multimodal.py
97+
:language: python
98+
:linenos:
99+
100+
Another example uses ``curl``:
101+
102+
.. literalinclude:: ../../../examples/serve/curl_completion_client_for_multimodal.sh
103+
:language: bash
104+
:linenos:
105+
69106
Benchmark
70107
---------
71108

examples/serve/curl_chat_client.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
curl http://localhost:8000/v1/chat/completions \
44
-H "Content-Type: application/json" \
55
-d '{
6-
"model": TinyLlama-1.1B-Chat-v1.0,
6+
"model": "TinyLlama-1.1B-Chat-v1.0",
77
"messages":[{"role": "system", "content": "You are a helpful assistant."},
88
{"role": "user", "content": "Where is New York?"}],
99
"max_tokens": 16,
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
#! /usr/bin/env bash
2+
3+
# Single image inference
4+
curl http://localhost:8000/v1/chat/completions \
5+
-H "Content-Type: application/json" \
6+
-d '{
7+
"model": "Qwen2-VL-7B-Instruct",
8+
"messages":[{
9+
"role": "system",
10+
"content": "You are a helpful assistant."
11+
}, {
12+
"role": "user",
13+
"content": [
14+
{
15+
"type": "text",
16+
"text": "Describe the natural environment in the image."
17+
},
18+
{
19+
"type":"image_url",
20+
"image_url": {
21+
"url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
22+
}
23+
}
24+
]
25+
}],
26+
"max_tokens": 64,
27+
"temperature": 0
28+
}'

examples/serve/curl_completion_client.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
curl http://localhost:8000/v1/completions \
44
-H "Content-Type: application/json" \
55
-d '{
6-
"model": TinyLlama-1.1B-Chat-v1.0,
6+
"model": "TinyLlama-1.1B-Chat-v1.0",
77
"prompt": "Where is New York?",
88
"max_tokens": 16,
99
"temperature": 0
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
### OpenAI Chat Client
2+
3+
from openai import OpenAI
4+
5+
client = OpenAI(
6+
base_url="http://localhost:8000/v1",
7+
api_key="tensorrt_llm",
8+
)
9+
10+
# Single image inference
11+
response = client.chat.completions.create(
12+
model="Qwen2-VL-7B-Instruct",
13+
messages=[{
14+
"role": "system",
15+
"content": "you are a helpful assistant"
16+
}, {
17+
"role":
18+
"user",
19+
"content": [{
20+
"type": "text",
21+
"text": "Describe the natural environment in the image."
22+
}, {
23+
"type": "image_url",
24+
"image_url": {
25+
"url":
26+
"https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
27+
}
28+
}]
29+
}],
30+
max_tokens=64,
31+
)
32+
print(response)
33+
34+
# TODO
35+
# multi-image inference
36+
# video inference

tensorrt_llm/_torch/models/modeling_llava_next.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ def __call__(
136136
self, inputs: TextPrompt, sampling_params: SamplingParams
137137
) -> Tuple[List[int], Optional[ExtraProcessedInputs]]:
138138
text_prompt, mm_data = inputs.get("prompt"), inputs.get(
139-
"multi_modal_data")
139+
"multi_modal_data", {})
140140
assert 'image' in mm_data
141141

142142
input_ids = self.tokenizer(

tensorrt_llm/_torch/models/modeling_qwen2vl.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -312,7 +312,7 @@ def __call__(
312312
sampling_params: SamplingParams,
313313
) -> Tuple[List[int], Optional[ExtraProcessedInputs]]:
314314
text_prompt, mm_data, mm_processor_kwargs = inputs.get("prompt"), \
315-
inputs.get("multi_modal_data"), inputs.get("mm_processor_kwargs", {})
315+
inputs.get("multi_modal_data", {}), inputs.get("mm_processor_kwargs", {})
316316

317317
# NOTE: Since we are passed in Tensor images, we don't need to rescale them.
318318
mm_processor_kwargs['do_rescale'] = False

tensorrt_llm/_torch/models/modeling_vila.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -1093,8 +1093,8 @@ def __call__(
10931093
(3) passed input_ids and mm_embed via LlmRequest's prompt_token_ids and prompt_embedding_table fields respectively. LlmRequests can be inflight batched, and the mm_embed is passed to LLM model as `multi_modal_data` which is List[torch.Tensor] for batched requests.
10941094
"""
10951095

1096-
text_prompt = inputs["prompt"]
1097-
mm_data = inputs["multi_modal_data"]
1096+
text_prompt, mm_data = inputs.get("prompt"), inputs.get(
1097+
"multi_modal_data", {})
10981098
mm_processor_kwargs = inputs.get("mm_processor_kwargs", {})
10991099

11001100
text_prompt = _apply_chat_template(text_prompt, self.conv_mode,

0 commit comments

Comments
 (0)