Skip to content

[Model] Add Reasoning Parser for Granite Models #14202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Mar 26, 2025

Conversation

alex-jw-brooks
Copy link
Contributor

@alex-jw-brooks alex-jw-brooks commented Mar 4, 2025

This PR adds a reasoning parser for Granite 3.2 models! These models have an optional chat template kwarg thinking that changes the system prompt to enable reasoning. 😄

The format of the text is expected to be:

Here is my thought process: <reasoning_content> Here is my response: <content>

There have been reports of quantized versions of the model using Here's instead of Here is though, so this PR matches on both.

Examples

Start the server with a granite (3.2) language model that has reasoning and the granite parser:

python vllm/entrypoints/openai/api_server.py \
    --device cuda \
    --model ibm-granite/granite-3.2-8b-instruct \
    --tokenizer ibm-granite/granite-3.2-8b-instruct \
    --enable-reasoning \
    --reasoning-parser granite

Snippets are copied from the docs, with the only change being adding chat_template_kwargs with thinking=True. Without this, reasoning will be disabled, and it'll generally parse everything into content.

No streaming:

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Round 1
messages = [
    {
        "role": "user",
        "content": "9.11 and 9.8, which is greater?"
    }
]
response = client.chat.completions.create(model=model, messages=messages, extra_body={"chat_template_kwargs": {"thinking": True}})

reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content

print("reasoning_content:", reasoning_content)
print("content:", content)

With streaming:

import json

import requests

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

models = requests.get(
    f"{openai_api_base}/models",
    headers={
        "Authorization": f"Bearer {openai_api_key}"
    },
).json()
model = models["data"][0]["id"]

# Streaming chat completions
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]

response = requests.post(
    f"{openai_api_base}/chat/completions",
    headers={"Authorization": f"Bearer {openai_api_key}"},
    json={
        "model": model,
        "messages": messages,
        "chat_template_kwargs": {"thinking": True},
        "stream": True
    },
)



print("client: Start streaming chat completions...")
printed_reasoning_content = False
printed_content = False
# Make the streaming request
if response.status_code == 200:
    # Process the streaming response
    for line in response.iter_lines():
        if line:  # Filter out keep-alive new lines
            # Decode the line and parse the JSON
            decoded_line = line.decode("utf-8")
            if decoded_line.startswith("data:"):
                data = decoded_line[5:].strip()  # Remove "data:" prefix
                if data == "[DONE]":  # End of stream
                    print("\nclient: Stream completed.")
                    break
                try:
                    # Parse the JSON data
                    chunk = json.loads(data)
                    reasoning_content = chunk["choices"][0]["delta"].get(
                        "reasoning_content", "")
                    content = chunk["choices"][0]["delta"].get("content", "")

                    if reasoning_content:
                        if not printed_reasoning_content:
                            printed_reasoning_content = True
                            print("reasoning_content:", end="", flush=True)
                        print(reasoning_content, end="", flush=True)
                    elif content:
                        if not printed_content:
                            printed_content = True
                            print("\ncontent:", end="", flush=True)
                        # Extract and print the content
                        print(content, end="", flush=True)
                except json.JSONDecodeError:
                    print("Error decoding JSON:", decoded_line)
else:
    print(f"Error: {response.status_code} - {response.text}")

Example output (run from the streaming snippet above)

reasoning_content:
This is a straightforward comparison of two numbers. The task is to determine which is larger: 9.11 or 9.8. 

I need to recall the value of these decimal numbers and compare them. Given both are very close, it requires precise comprehension to understand which has the larger value—specifically focusing on the tenths and hundredths places.


content:

9.8 is greater than 9.11. 

Let's break down the comparison:

- Both numbers are above 9, so we're comparing the decimal parts.
- 9.11 has a '11' in the hundredths place.
- 9.8 has an '80' in the hundredths place, which is larger (even if it's ten times, 80 > 11).

Therefore, 9.8 > 9.11.
client: Stream completed.

Copy link

github-actions bot commented Mar 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation frontend labels Mar 4, 2025
@DarkLight1337 DarkLight1337 requested a review from mgoin March 4, 2025 15:28
@mgoin
Copy link
Member

mgoin commented Mar 4, 2025

Nice use of this new feature! Will try in a bit cc @gaocegege

Copy link
Contributor

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Could you please rebase the upstream? In a previous PR to support reasoning outputs in structured outputs https://github.com/vllm-project/vllm/pull/12955/files#diff-ea8b8ff63961713ccb62d78e53e96404b587b7828cb9fee08a9e5576bf563673R1065, we moved the CLI argument --reasoning-parser to https://github.com/vllm-project/vllm/blob/main/vllm/engine/arg_utils.py#L1076

Thus you may need to add a new choice there.

@gaocegege
Copy link
Contributor

Hi, I updated the docs in this PR #14114

Maybe you should rebase the docs too. Just FYI

@@ -19,6 +19,10 @@ def get_reasoner(tokenizer: PreTrainedTokenizer,
return None
elif reasoning_backend == "deepseek_r1":
return DeepSeekReasoner.from_tokenizer(tokenizer)
elif reasoning_backend == "granite":
logger.warning(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a warning for now since this is already a large PR, but I think adding a GraniteReasoner for guided decoding could be a follow-up later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just fail for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aarnphm This will behave the same way as models with no reasoner. The intention here was mostly to clarify that there isn't a reasoning backend for granite in case users conflate it with --enable-reasoning / --reasoning-parser granite being supported for these models

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wfm.

@alex-jw-brooks
Copy link
Contributor Author

Awesome thanks @gaocegege! It's been rebased 😄

@alex-jw-brooks alex-jw-brooks requested a review from gaocegege March 6, 2025 08:40
Copy link
Contributor

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! 🎉 👍

@gaocegege
Copy link
Contributor

@mgoin Please give it another review, thanks!

response_start)
reasoning_content = current_text[
start_reasoning_content:end_reasoning_content]
response_content = current_text[current_chunk_end + 1:]
Copy link
Contributor

@b8zhong b8zhong Mar 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
response_content = current_text[current_chunk_end + 1:]
response_content = current_text[current_chunk_end + 1:]
parsed_content = True

parsed_content flag doesn't seem to be updated, so maybe helpful to set it?
Very minor suggestion, totally optional

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @b8zhong, thanks for the suggestion! For now, I'd prefer to keep it as is since it returns immediately after parsing the response content. I.e., once this condition is met, there is no need to keep going, so updating the flag won't do anything 🙂

Copy link

mergify bot commented Mar 7, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alex-jw-brooks.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 7, 2025
@gaocegege
Copy link
Contributor

@alex-jw-brooks Hi could you please resolve the conflicts

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 11, 2025
Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny comments,otherwise lgtm.

Comment on lines 180 to 181
* The reasoning content is only available for online serving's chat completion endpoint (`/v1/chat/completions`).
* It is not compatible with [`tool_calling`](#tool_calling).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rever this to reduce code change? thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, done!

@@ -19,6 +19,10 @@ def get_reasoner(tokenizer: PreTrainedTokenizer,
return None
elif reasoning_backend == "deepseek_r1":
return DeepSeekReasoner.from_tokenizer(tokenizer)
elif reasoning_backend == "granite":
logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just fail for now.

Co-authored-by: Joe Runde <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
@alex-jw-brooks
Copy link
Contributor Author

Thanks @aarnphm - it's ready for another look when you have a moment 🙂

Copy link
Collaborator

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great work!

@alex-jw-brooks
Copy link
Contributor Author

Hi @mgoin, can you please take a look at this PR when you have a moment?

@gaocegege
Copy link
Contributor

@mgoin @simon-mo Could you please take a look at this?

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is basically coming from the model vendor, I'll just stamp it. The code looks reasonable to me

@DarkLight1337
Copy link
Member

Can you fix the merge conflicts?

Copy link

mergify bot commented Mar 26, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @alex-jw-brooks.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 26, 2025
@mergify mergify bot removed the needs-rebase label Mar 26, 2025
Signed-off-by: Alex-Brooks <[email protected]>
@alex-jw-brooks
Copy link
Contributor Author

Thanks for the review @DarkLight1337! Sure should be resolved now 🤞

@DarkLight1337 DarkLight1337 enabled auto-merge (squash) March 26, 2025 12:20
@DarkLight1337 DarkLight1337 merged commit 1711b92 into vllm-project:main Mar 26, 2025
40 checks passed
lengrongfu pushed a commit to lengrongfu/vllm that referenced this pull request Apr 2, 2025
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Apr 2, 2025
Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Alex4210987 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Apr 5, 2025
Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Signed-off-by: xinyuxiao <[email protected]>
lulmer pushed a commit to lulmer/vllm that referenced this pull request Apr 7, 2025
Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Signed-off-by: Louis Ulmer <[email protected]>
nishith-fujitsu pushed a commit to nishith-fujitsu/vllm that referenced this pull request Apr 9, 2025
lk-chen pushed a commit to lk-chen/vllm that referenced this pull request Apr 29, 2025
shreyankg pushed a commit to shreyankg/vllm that referenced this pull request May 3, 2025
RichardoMrMu pushed a commit to RichardoMrMu/vllm that referenced this pull request May 12, 2025
Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Signed-off-by: Mu Huai <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed structured-output
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants