-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Frontend] Support reasoning content for deepseek r1 #12473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
70c81ad
feat: Support reasoning
gaocegege bfd64fc
fix: Address comments
gaocegege f716096
chore: Add more tests
gaocegege bfa38ec
fix: Fix type check
gaocegege 129005a
fix: Fix types
gaocegege 8545a52
feat: Add examples for full generation
gaocegege 88694da
feat: Add streaming examples with requests lib
gaocegege 565b701
chore: Add docs into features
gaocegege 2187bfc
chhore: Address comments
gaocegege ce0a258
chore: Address comments
gaocegege 11df3ad
Update docs/source/features/reasoning_outputs.md
gaocegege daeb818
Update docs/source/features/reasoning_outputs.md
gaocegege 3968d3b
Update docs/source/features/reasoning_outputs.md
gaocegege a9c278c
Update vllm/entrypoints/openai/cli_args.py
gaocegege f0b0a06
fix: Address comments
gaocegege 723172f
Update tests/entrypoints/openai/reasoning_parsers/utils.py
gaocegege 386af32
Update docs/source/features/reasoning_outputs.md
gaocegege a6e69d6
Update docs/source/features/reasoning_outputs.md
gaocegege b6339ac
fix: Address comments
gaocegege File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
(reasoning-outputs)= | ||
|
||
# Reasoning Outputs | ||
|
||
vLLM offers support for reasoning models like [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1), which are designed to generate outputs containing both reasoning steps and final conclusions. | ||
|
||
Reasoning models return a additional `reasoning_content` field in their outputs, which contains the reasoning steps that led to the final conclusion. This field is not present in the outputs of other models. | ||
|
||
## Supported Models | ||
|
||
vLLM currently supports the following reasoning models: | ||
|
||
- [DeepSeek R1 series](https://huggingface.co/collections/deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d) (`deepseek_r1`, which looks for `<think> ... </think>`) | ||
|
||
## Quickstart | ||
|
||
To use reasoning models, you need to specify the `--enable-reasoning` and `--reasoning-parser` flags when making a request to the chat completion endpoint. The `--reasoning-parser` flag specifies the reasoning parser to use for extracting reasoning content from the model output. | ||
|
||
```bash | ||
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ | ||
--enable-reasoning --reasoning-parser deepseek_r1 | ||
``` | ||
|
||
Next, make a request to the model that should return the reasoning content in the response. | ||
|
||
```python | ||
from openai import OpenAI | ||
|
||
# Modify OpenAI's API key and API base to use vLLM's API server. | ||
openai_api_key = "EMPTY" | ||
openai_api_base = "http://localhost:8000/v1" | ||
|
||
client = OpenAI( | ||
api_key=openai_api_key, | ||
base_url=openai_api_base, | ||
) | ||
|
||
models = client.models.list() | ||
model = models.data[0].id | ||
|
||
# Round 1 | ||
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] | ||
response = client.chat.completions.create(model=model, messages=messages) | ||
|
||
reasoning_content = response.choices[0].message.reasoning_content | ||
content = response.choices[0].message.content | ||
|
||
print("reasoning_content:", reasoning_content) | ||
print("content:", content) | ||
``` | ||
|
||
The `reasoning_content` field contains the reasoning steps that led to the final conclusion, while the `content` field contains the final conclusion. | ||
|
||
## Streaming chat completions | ||
|
||
Streaming chat completions are also supported for reasoning models. The `reasoning_content` field is available in the `delta` field in [chat completion response chunks](https://platform.openai.com/docs/api-reference/chat/streaming). | ||
|
||
```json | ||
{ | ||
"id": "chatcmpl-123", | ||
"object": "chat.completion.chunk", | ||
"created": 1694268190, | ||
"model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", | ||
"system_fingerprint": "fp_44709d6fcb", | ||
"choices": [ | ||
{ | ||
"index": 0, | ||
"delta": { | ||
"role": "assistant", | ||
"reasoning_content": "is", | ||
}, | ||
"logprobs": null, | ||
"finish_reason": null | ||
} | ||
] | ||
} | ||
``` | ||
|
||
Please note that it is not compatible with the OpenAI Python client library. You can use the `requests` library to make streaming requests. | ||
|
||
## How to support a new reasoning model | ||
|
||
You can add a new `ReasoningParser` similar to `vllm/entrypoints/openai/reasoning_parsers/deepseek_r1_reasoning_parser.py`. | ||
|
||
```python | ||
# import the required packages | ||
|
||
from vllm.entrypoints.openai.reasoning_parsers.abs_reasoning_parsers import ( | ||
ReasoningParser, ReasoningParserManager) | ||
from vllm.entrypoints.openai.protocol import (ChatCompletionRequest, | ||
DeltaMessage) | ||
|
||
# define a reasoning parser and register it to vllm | ||
# the name list in register_module can be used | ||
# in --reasoning-parser. | ||
@ReasoningParserManager.register_module(["example"]) | ||
class ExampleParser(ReasoningParser): | ||
def __init__(self, tokenizer: AnyTokenizer): | ||
super().__init__(tokenizer) | ||
|
||
def extract_reasoning_content_streaming( | ||
self, | ||
previous_text: str, | ||
current_text: str, | ||
delta_text: str, | ||
previous_token_ids: Sequence[int], | ||
current_token_ids: Sequence[int], | ||
delta_token_ids: Sequence[int], | ||
) -> Union[DeltaMessage, None]: | ||
""" | ||
Instance method that should be implemented for extracting reasoning | ||
from an incomplete response; for use when handling reasoning calls and | ||
streaming. Has to be an instance method because it requires state - | ||
the current tokens/diffs, but also the information about what has | ||
previously been parsed and extracted (see constructor) | ||
""" | ||
|
||
def extract_reasoning_content( | ||
self, model_output: str, request: ChatCompletionRequest | ||
) -> Tuple[Optional[str], Optional[str]]: | ||
""" | ||
Extract reasoning content from a complete model-generated string. | ||
|
||
Used for non-streaming responses where we have the entire model response | ||
available before sending to the client. | ||
|
||
Parameters: | ||
model_output: str | ||
The model-generated string to extract reasoning content from. | ||
|
||
request: ChatCompletionRequest | ||
The request object that was used to generate the model_output. | ||
|
||
Returns: | ||
Tuple[Optional[str], Optional[str]] | ||
A tuple containing the reasoning content and the content. | ||
""" | ||
``` | ||
|
||
After defining the reasoning parser, you can use it by specifying the `--reasoning-parser` flag when making a request to the chat completion endpoint. | ||
|
||
```bash | ||
vllm serve <model_tag> \ | ||
--enable-reasoning --reasoning-parser example | ||
``` | ||
|
||
## Limitations | ||
|
||
- The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`). | ||
- It is not compatible with the [`structured_outputs`](#structured_outputs) and [`tool_calling`](#tool_calling) features. | ||
- The reasoning content is not available for all models. Check the model's documentation to see if it supports reasoning. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
53 changes: 53 additions & 0 deletions
53
examples/online_serving/openai_chat_completion_with_reasoning.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
""" | ||
An example shows how to generate chat completions from reasoning models | ||
like DeepSeekR1. | ||
|
||
To run this example, you need to start the vLLM server with the reasoning | ||
parser: | ||
|
||
```bash | ||
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ | ||
--enable-reasoning --reasoning-parser deepseek_r1 | ||
``` | ||
|
||
This example demonstrates how to generate chat completions from reasoning models | ||
using the OpenAI Python client library. | ||
""" | ||
|
||
from openai import OpenAI | ||
|
||
# Modify OpenAI's API key and API base to use vLLM's API server. | ||
openai_api_key = "EMPTY" | ||
openai_api_base = "http://localhost:8000/v1" | ||
|
||
client = OpenAI( | ||
api_key=openai_api_key, | ||
base_url=openai_api_base, | ||
) | ||
|
||
models = client.models.list() | ||
model = models.data[0].id | ||
|
||
# Round 1 | ||
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] | ||
response = client.chat.completions.create(model=model, messages=messages) | ||
|
||
reasoning_content = response.choices[0].message.reasoning_content | ||
content = response.choices[0].message.content | ||
|
||
print("reasoning_content:", reasoning_content) | ||
print("content:", content) | ||
|
||
# Round 2 | ||
messages.append({"role": "assistant", "content": content}) | ||
messages.append({ | ||
"role": "user", | ||
"content": "How many Rs are there in the word 'strawberry'?", | ||
}) | ||
response = client.chat.completions.create(model=model, messages=messages) | ||
|
||
reasoning_content = response.choices[0].message.reasoning_content | ||
content = response.choices[0].message.content | ||
|
||
print("reasoning_content:", reasoning_content) | ||
print("content:", content) |
90 changes: 90 additions & 0 deletions
90
examples/online_serving/openai_chat_completion_with_reasoning_streaming.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
""" | ||
An example shows how to generate chat completions from reasoning models | ||
like DeepSeekR1. | ||
|
||
To run this example, you need to start the vLLM server with the reasoning | ||
parser: | ||
|
||
```bash | ||
vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B \ | ||
--enable-reasoning --reasoning-parser deepseek_r1 | ||
``` | ||
|
||
Unlike openai_chat_completion_with_reasoning.py, this example demonstrates the | ||
streaming chat completions feature. | ||
|
||
The streaming chat completions feature allows you to receive chat completions | ||
in real-time as they are generated by the model. This is useful for scenarios | ||
where you want to display chat completions to the user as they are generated | ||
by the model. | ||
|
||
Here we do not use the OpenAI Python client library, because it does not support | ||
`reasoning_content` fields in the response. | ||
""" | ||
|
||
import json | ||
|
||
import requests | ||
|
||
# Modify OpenAI's API key and API base to use vLLM's API server. | ||
openai_api_key = "EMPTY" | ||
openai_api_base = "http://localhost:8000/v1" | ||
|
||
models = requests.get( | ||
f"{openai_api_base}/models", | ||
headers={ | ||
"Authorization": f"Bearer {openai_api_key}" | ||
}, | ||
).json() | ||
model = models["data"][0]["id"] | ||
|
||
# Streaming chat completions | ||
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}] | ||
|
||
response = requests.post( | ||
f"{openai_api_base}/chat/completions", | ||
headers={"Authorization": f"Bearer {openai_api_key}"}, | ||
json={ | ||
"model": model, | ||
"messages": messages, | ||
"stream": True | ||
}, | ||
) | ||
|
||
print("client: Start streaming chat completions...") | ||
printed_reasoning_content = False | ||
printed_content = False | ||
# Make the streaming request | ||
if response.status_code == 200: | ||
# Process the streaming response | ||
for line in response.iter_lines(): | ||
if line: # Filter out keep-alive new lines | ||
# Decode the line and parse the JSON | ||
decoded_line = line.decode("utf-8") | ||
if decoded_line.startswith("data:"): | ||
data = decoded_line[5:].strip() # Remove "data:" prefix | ||
if data == "[DONE]": # End of stream | ||
print("\nclient: Stream completed.") | ||
break | ||
try: | ||
# Parse the JSON data | ||
chunk = json.loads(data) | ||
reasoning_content = chunk["choices"][0]["delta"].get( | ||
"reasoning_content", "") | ||
content = chunk["choices"][0]["delta"].get("content", "") | ||
|
||
if reasoning_content: | ||
if not printed_reasoning_content: | ||
printed_reasoning_content = True | ||
print("reasoning_content:", end="", flush=True) | ||
print(reasoning_content, end="", flush=True) | ||
elif content: | ||
if not printed_content: | ||
printed_content = True | ||
print("\ncontent:", end="", flush=True) | ||
# Extract and print the content | ||
print(content, end="", flush=True) | ||
except json.JSONDecodeError: | ||
print("Error decoding JSON:", decoded_line) | ||
else: | ||
print(f"Error: {response.status_code} - {response.text}") |
Empty file.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.