Skip to content

Commit 17e91fe

Browse files
gaocegegerafvasqDarkLight1337mgoin
authored andcommitted
[Frontend] Support reasoning content for deepseek r1 (vllm-project#12473)
Signed-off-by: Ce Gao <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Michael Goin <[email protected]> Signed-off-by: Isotr0py <[email protected]>
1 parent f427e52 commit 17e91fe

16 files changed

+977
-5
lines changed
+151
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
(reasoning-outputs)=
2+
3+
# Reasoning Outputs
4+
5+
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions.
6+
7+
Reasoning models return a additional `reasoning_content` field in their outputs, which contains the reasoning steps that led to the final conclusion. This field is not present in the outputs of other models.
8+
9+
## Supported Models
10+
11+
vLLM currently supports the following reasoning models:
12+
13+
- [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) (`deepseek_r1`, which looks for `<think> ... </think>`)
14+
15+
## Quickstart
16+
17+
To use reasoning models, you need to specify the `--enable-reasoning` and `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output.
18+
19+
```bash
20+
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
21+
--enable-reasoning --reasoning-parser deepseek_r1
22+
```
23+
24+
Next, make a request to the model that should return the reasoning content in the response.
25+
26+
```python
27+
from openai import OpenAI
28+
29+
# Modify OpenAI's API key and API base to use vLLM's API server.
30+
openai_api_key = "EMPTY"
31+
openai_api_base = "http://localhost:8000/v1"
32+
33+
client = OpenAI(
34+
api_key=openai_api_key,
35+
base_url=openai_api_base,
36+
)
37+
38+
models = client.models.list()
39+
model = models.data[0].id
40+
41+
# Round 1
42+
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
43+
response = client.chat.completions.create(model=model, messages=messages)
44+
45+
reasoning_content = response.choices[0].message.reasoning_content
46+
content = response.choices[0].message.content
47+
48+
print("reasoning_content:", reasoning_content)
49+
print("content:", content)
50+
```
51+
52+
The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion.
53+
54+
## Streaming chat completions
55+
56+
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming).
57+
58+
```json
59+
{
60+
"id": "chatcmpl-123",
61+
"object": "chat.completion.chunk",
62+
"created": 1694268190,
63+
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
64+
"system_fingerprint": "fp_44709d6fcb",
65+
"choices": [
66+
{
67+
"index": 0,
68+
"delta": {
69+
"role": "assistant",
70+
"reasoning_content": "is",
71+
},
72+
"logprobs": null,
73+
"finish_reason": null
74+
}
75+
]
76+
}
77+
```
78+
79+
Please note that it is not compatible with the OpenAI Python client library. You can use the `requests` library to make streaming requests.
80+
81+
## How to support a new reasoning model
82+
83+
You can add a new `ReasoningParser` similar to `vllm/entrypoints/openai/reasoning_parsers/deepseek_r1_reasoning_parser.py`.
84+
85+
```python
86+
# import the required packages
87+
88+
from vllm.entrypoints.openai.reasoning_parsers.abs_reasoning_parsers import (
89+
ReasoningParser, ReasoningParserManager)
90+
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest,
91+
DeltaMessage)
92+
93+
# define a reasoning parser and register it to vllm
94+
# the name list in register_module can be used
95+
# in --reasoning-parser.
96+
@ReasoningParserManager.register_module(["example"])
97+
class ExampleParser(ReasoningParser):
98+
def __init__(self, tokenizer: AnyTokenizer):
99+
super().__init__(tokenizer)
100+
101+
def extract_reasoning_content_streaming(
102+
self,
103+
previous_text: str,
104+
current_text: str,
105+
delta_text: str,
106+
previous_token_ids: Sequence[int],
107+
current_token_ids: Sequence[int],
108+
delta_token_ids: Sequence[int],
109+
) -> Union[DeltaMessage, None]:
110+
"""
111+
Instance method that should be implemented for extracting reasoning
112+
from an incomplete response; for use when handling reasoning calls and
113+
streaming. Has to be an instance method because it requires state -
114+
the current tokens/diffs, but also the information about what has
115+
previously been parsed and extracted (see constructor)
116+
"""
117+
118+
def extract_reasoning_content(
119+
self, model_output: str, request: ChatCompletionRequest
120+
) -> Tuple[Optional[str], Optional[str]]:
121+
"""
122+
Extract reasoning content from a complete model-generated string.
123+
124+
Used for non-streaming responses where we have the entire model response
125+
available before sending to the client.
126+
127+
Parameters:
128+
model_output: str
129+
The model-generated string to extract reasoning content from.
130+
131+
request: ChatCompletionRequest
132+
The request object that was used to generate the model_output.
133+
134+
Returns:
135+
Tuple[Optional[str], Optional[str]]
136+
A tuple containing the reasoning content and the content.
137+
"""
138+
```
139+
140+
After defining the reasoning parser, you can use it by specifying the `--reasoning-parser` flag when making a request to the chat completion endpoint.
141+
142+
```bash
143+
vllm serve <model_tag> \
144+
--enable-reasoning --reasoning-parser example
145+
```
146+
147+
## Limitations
148+
149+
- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
150+
- It is not compatible with the [`structured_outputs`](#structured_outputs) and [`tool_calling`](#tool_calling) features.
151+
- The reasoning content is not available for all models. Check the model's documentation to see if it supports reasoning.

docs/source/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ models/extensions/index
9090
features/quantization/index
9191
features/lora
9292
features/tool_calling
93+
features/reasoning_outputs
9394
features/structured_outputs
9495
features/automatic_prefix_caching
9596
features/disagg_prefill
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""
2+
An example shows how to generate chat completions from reasoning models
3+
like DeepSeekR1.
4+
5+
To run this example, you need to start the vLLM server with the reasoning
6+
parser:
7+
8+
```bash
9+
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
10+
--enable-reasoning --reasoning-parser deepseek_r1
11+
```
12+
13+
This example demonstrates how to generate chat completions from reasoning models
14+
using the OpenAI Python client library.
15+
"""
16+
17+
from openai import OpenAI
18+
19+
# Modify OpenAI's API key and API base to use vLLM's API server.
20+
openai_api_key = "EMPTY"
21+
openai_api_base = "http://localhost:8000/v1"
22+
23+
client = OpenAI(
24+
api_key=openai_api_key,
25+
base_url=openai_api_base,
26+
)
27+
28+
models = client.models.list()
29+
model = models.data[0].id
30+
31+
# Round 1
32+
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
33+
response = client.chat.completions.create(model=model, messages=messages)
34+
35+
reasoning_content = response.choices[0].message.reasoning_content
36+
content = response.choices[0].message.content
37+
38+
print("reasoning_content:", reasoning_content)
39+
print("content:", content)
40+
41+
# Round 2
42+
messages.append({"role": "assistant", "content": content})
43+
messages.append({
44+
"role": "user",
45+
"content": "How many Rs are there in the word 'strawberry'?",
46+
})
47+
response = client.chat.completions.create(model=model, messages=messages)
48+
49+
reasoning_content = response.choices[0].message.reasoning_content
50+
content = response.choices[0].message.content
51+
52+
print("reasoning_content:", reasoning_content)
53+
print("content:", content)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
"""
2+
An example shows how to generate chat completions from reasoning models
3+
like DeepSeekR1.
4+
5+
To run this example, you need to start the vLLM server with the reasoning
6+
parser:
7+
8+
```bash
9+
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \
10+
--enable-reasoning --reasoning-parser deepseek_r1
11+
```
12+
13+
Unlike openai_chat_completion_with_reasoning.py, this example demonstrates the
14+
streaming chat completions feature.
15+
16+
The streaming chat completions feature allows you to receive chat completions
17+
in real-time as they are generated by the model. This is useful for scenarios
18+
where you want to display chat completions to the user as they are generated
19+
by the model.
20+
21+
Here we do not use the OpenAI Python client library, because it does not support
22+
`reasoning_content` fields in the response.
23+
"""
24+
25+
import json
26+
27+
import requests
28+
29+
# Modify OpenAI's API key and API base to use vLLM's API server.
30+
openai_api_key = "EMPTY"
31+
openai_api_base = "http://localhost:8000/v1"
32+
33+
models = requests.get(
34+
f"{openai_api_base}/models",
35+
headers={
36+
"Authorization": f"Bearer {openai_api_key}"
37+
},
38+
).json()
39+
model = models["data"][0]["id"]
40+
41+
# Streaming chat completions
42+
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
43+
44+
response = requests.post(
45+
f"{openai_api_base}/chat/completions",
46+
headers={"Authorization": f"Bearer {openai_api_key}"},
47+
json={
48+
"model": model,
49+
"messages": messages,
50+
"stream": True
51+
},
52+
)
53+
54+
print("client: Start streaming chat completions...")
55+
printed_reasoning_content = False
56+
printed_content = False
57+
# Make the streaming request
58+
if response.status_code == 200:
59+
# Process the streaming response
60+
for line in response.iter_lines():
61+
if line: # Filter out keep-alive new lines
62+
# Decode the line and parse the JSON
63+
decoded_line = line.decode("utf-8")
64+
if decoded_line.startswith("data:"):
65+
data = decoded_line[5:].strip() # Remove "data:" prefix
66+
if data == "[DONE]": # End of stream
67+
print("\nclient: Stream completed.")
68+
break
69+
try:
70+
# Parse the JSON data
71+
chunk = json.loads(data)
72+
reasoning_content = chunk["choices"][0]["delta"].get(
73+
"reasoning_content", "")
74+
content = chunk["choices"][0]["delta"].get("content", "")
75+
76+
if reasoning_content:
77+
if not printed_reasoning_content:
78+
printed_reasoning_content = True
79+
print("reasoning_content:", end="", flush=True)
80+
print(reasoning_content, end="", flush=True)
81+
elif content:
82+
if not printed_content:
83+
printed_content = True
84+
print("\ncontent:", end="", flush=True)
85+
# Extract and print the content
86+
print(content, end="", flush=True)
87+
except json.JSONDecodeError:
88+
print("Error decoding JSON:", decoded_line)
89+
else:
90+
print(f"Error: {response.status_code} - {response.text}")

tests/entrypoints/openai/reasoning_parsers/__init__.py

Whitespace-only changes.

0 commit comments

Comments
 (0)