`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik · 2025-03-14T04:45:40Z

This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing

Support streaming of tool calls in OpenAI format
Improve handling of thinking model (DeepSeek R1 Distills, QwQ, Command R7B):
- Stream <think> reasoning content inside the content (same output for all thinking models when using the default --reasoning-content deepseek, even for those not using the <think> syntax like Command R7B), and even if the <think> tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).
- Avoid spurious lazy (tool call) grammar triggers from "thoughts about tool calls" (only trigger after closing any unclosed thoughts)
Improves Functionary v3.2 support (allow raw python code, preferred by models over {"code": "json-encoded code"} for multiline programs)
Support truncated outputs incl. reasoning_content & tool_calls (returns salvageable fields when finish_reason = length)

Follow up to #9639

How to test / use

Get and build this PR's branch

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-diffs
cmake -B build -DLLAMA_CURL=1 # -DGGML_CUDA=1 ...
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

Run llama-server w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):

# Thoughts of Command R7B / DeepSeek R1 / QwQ will be streamed in the content inside <think> tags
llama-server --jinja -fa -hf bartowski/Qwen_QwQ-32B-GGUF

# Models w/ generic tool call support now return clean interrupted output when hitting token limit
llama-server --jinja -fa -hf bartowski/microsoft_Phi-4-mini-instruct-GGUF

Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:

curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gpt-3.5-turbo",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Print a hello world message with python."
    }
  ],
  "stream": true
}'

You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening <think> tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since —reasoning-format deepseek is the default)

Context

Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.

While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.

Function calls and their arguments can be at various levels:

JSON array of tool calls (e.g. Mistral Nemo: [TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}])
Standalone JSON tool call (e.g. Hermes syntax: <tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>; note that some models use other keys here, e.g. tool_name, parameters, and may have the tool call id too)
JSON arguments object w/ name in some prefix (e.g. Deepseek: <｜tool▁calls▁begin｜><｜tool▁call▁begin｜>function<｜tool▁sep｜>special_function\n```json\n{"arg1": 1}\n```<｜tool▁call▁end｜><｜tool▁calls▁end｜>, or functionary v3.2: special_function\n{"arg1": 1})
Nested JSON for the generic mode {"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}} (or inside tool_calls array if parallel_tool_calls is on)
No JSON / raw code string for python tool call, with two variants:
- Unconstrained verbatim code: <|python_tag|>multiline python code here (functionary v3.1), python\nmultiline python code here (functionary v3.2; w/ prefix >>> if after textual response)
- Constrained pythonish syntax for "builtin tools" (Llama 3.x, quite widespread): <|python_tag|>python.call(code="multiline\npython\ncode\nhere")

Side note about raw python code: <|python_tag>foo.call(bar="baz") in Llama 3.x style will return "tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}], while the same output from Functionary would be parsed as "tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}].

Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.

(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)

The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full common_chat_msg against the last one we sent back, and compute OpenAI-compatible deltas out of this.

Location, location, location 🏡

Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g. <think> isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.

But more interesting is where it happens, esp. for partial JSON:

If it happens inside an arguments object or a contents string (for generic mode), we should return it partial / truncated (and json-dumped in the case of the arguments), and diffed from the last parsed value for the streamed case
If it happens inside the wrapper of the arguments, then it depends. We don't want to get a half-function name, but as soon as we have a complete function name we can send a diff. So we try and heal the JSON (we identify which json paths can be partially healed - because they're inside the arguments, and which ones must be dropped), and only populate a tool call if we have at least a name). Likewise, if there is an array of function calls with the first complete, and the next partial, we want to make sure the client can start calling the first function.

tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for /* is_partial= */ true)

See examples of streamed tool call deltas

curl https://api.openai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -d '{
    "model": "gpt-3.5-turbo",
    "tools": [
        {
        "type":"function",
        "function":{
            "name":"python",
            "description":"Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
            "parameters":{
            "type":"object",
            "properties":{
                "code":{
                "type":"string",
                "description":"The code to run in the ipython interpreter."
                }
            },
            "required":["code"]
            }
        }
        }
    ],
    "messages": [
        {
        "role": "user",
        "content": "Print a hello world message with python."
        }
    ], "stream": true
}'

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"role":"assistant","content":null,"tool_calls":[{"index":0,"id":"call_aqwOReHDKPnqiF7NbRxzDTY1","type":"function","function":{"name":"python","arguments":""}}],"refusal":null},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"{\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"code"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\":\""}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"print"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"('"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"Hello"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":","}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":" World"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"!"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"')"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{"tool_calls":[{"index":0,"function":{"arguments":"\"}"}}]},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-BArAOzrqdSYW1UJvOwWb9nXAy1qo4","object":"chat.completion.chunk","created":1741927352,"model":"gpt-3.5-turbo-0125","service_tier":"default","system_fingerprint":null,"choices":[{"index":0,"delta":{},"logprobs":null,"finish_reason":"tool_calls"}]}

data: [DONE]

Implementation notes

Partial parsing utils

I added a common_chat_msg_parser utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:

Supports partial regex parsing
- Since the STL still doesn't have partial matching support (unlike Boost), I had to implement my own in common_regex (see common/regex-partial.cpp).
- The trick = transform the original regex to a regex that matches in reverse from the end of the string (e.g. /abc/ gives /((?:(?:c)?b)?a)[\s\S]*/, with a single capturing group which end indicates - in reverse - where the partial match started)
Supports partial JSON parsing:
- Used nlohmann/json's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parse
- Healing the JSON w/ a healing marker that can then be found when visiting the resulting JSON (to remove things we don't want to heal - e.g. function name - and cut any JSON encoded result at the "right" place, which must be somewhere inside function arguments: consume_json accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)
Supports control flow w/ try_* parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart from optionals, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.
Supports full or partial parsing w/ same code (throws partial exceptions to interrupt the control flow without making parsing code more complex)

This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when finish_reason != tool_call).

To think or not to think... what is the prompt?

I've also introduced common_chat_syntax which wraps common_reasoning_format, common_chat_format together with:

thinking_forced_open: whether the prompt was detected to end w/ a (model-specific) <think> tag to force thinking mode
reasoning_in_content: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.

This allows streaming back a standard <think>... syntax even for models that use a different set of tags (e.g. Command R7B). And of course, --reasoning-format none is still allowed to get the raw output.

Note: Ideally, we'd stream the thoughts as a reasoning_content delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if --reasoning-format deepseek, which is the default).

Triggering thoughts 😓

I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.

To address this, I made it possible for common_chat_templates_apply to create trigger regexes that match on the entire output (this was already the case in the sampler). COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL (renamed from _START) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.

Functionary v3.2 w/ raw python

Ask bartowski/functionary-small-v3.2-GGUF:Q4_K_M to write a hello world in Python and it outputs python\n{"code": "print('hey')"}.

But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax python\nprint('hey')\n# many other lines. This is now supported.

TODOs

Future follow ups:

To make this faster, I suggest two options:
- Wait for the project to switch to C++20 & turn all the parser functions into resumable coroutines (feed them tokens and persist their state in the slot)
- Only compute and send deltas after N milliseconds

cc/ @jpohhhh

…al based thinking tags parsing)

llowrey · 2025-03-21T16:20:48Z

Thanks for the quick response @ochafik

Here's the console output:

srv  update_chat_: Parsing chat message: {"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",
Parsing input with format Generic: {"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",
Failed to parse up to error: [json.exception.parse_error.101] parse error at line 1, column 94: syntax error while parsing object key - unexpected end of input; expected string literal: <<<{"tool_call": {"name": "puppeteer_screenshot", "arguments": {"name": "servethehome_homepage",>>>
Parsed partial JSON: {"tool_call":{"name":"puppeteer_screenshot","arguments":{"name":"278722862"}}} (json_healing_marker: "278722862)
Cleaned up JSON {"tool_call":{"name":"puppeteer_screenshot","arguments":{"name":"278722862"}}} to {"tool_call":{"name":"puppeteer_screenshot","arguments":"{\"name\":"}} (json_healing_marker : '"278722862')
Partial parse: incomplete tool call
Parsed message: {"role":"assistant","content":null,"tool_calls":[{"type":"function","function":{"name":"puppeteer_screenshot","arguments":"{\"name\":"}}]}
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid diff: '{"name":"servethehome_homepage' not found at start of '{"name":'
Aborted (core dumped)

Here's the POST body that causes this: crash.json

I can send with postman and get a crash every time

After more experimenting, it works most of the time. I just got unlucky with my first attempt. I really appreciate the work you are doing and hope this info helps.

ochafik · 2025-03-23T14:18:42Z

@llowrey that specific crash should now be fixed, thanks again for the full details!

Column01 · 2025-04-03T19:16:31Z

Trying this out for myself, specifically the streamed tool calls using Qwen2.5 14B I get the following behavior

There is no error in the llama-server log but here it is: https://gist.github.com/Column01/bdce2d58e53e2d440d8bb3f124e64131

ochafik · 2025-04-04T20:28:45Z

Trying this out for myself, specifically the streamed tool calls using Qwen2.5 14B I get the following behavior

There is no error in the llama-server log but here it is: https://gist.github.com/Column01/bdce2d58e53e2d440d8bb3f124e64131

@Column01 thanks for sharing this! I would really advise against extreme KV quantizations (esp. K) as it seems to severely degrade tool call performance in most models I tested (in your case just switching to a less aggressive -ctk q8_0 might do the trick).

(I've updated docs/function-calling.md accordingly in this branch; also, tied up a few more loose ends that should make the Qwen2.5 14B experience smoother, please give it another go if you have a chance!)

ggml-org#12729

f-krull · 2025-04-05T13:21:20Z

common/chat.h

@@ -3,6 +3,7 @@
 #pragma once

 #include "common.h"
+#include <functional>
 #include <string>
 #include <vector>


+#include <chrono>

ggml-org#12729

…heir non-tool mode

ochafik added 12 commits March 12, 2025 23:51

add common_regex w/ support for partial final matches

16c9c63

add common_json w/ support for truncated json healing

6dcff43

renaming: string_find_partial_stop (moved to common.cpp)

a95fe78

add common_chat_msg_diff

ce2f593

partial common_chat_parse

cd3681d

refactor parser w/ optionals

9462365

server: wire chat diffs in stream mode

6ed8a8f

fix trigger of thinking models (must happen after thoughts are closed)

eaeed7d

nits + docs

d6e680a

fix functionary v3.2 raw python!

64ea080

rename: common_chat_syntax (now contains format)

c46d4da

rm common_regex.at_start

4358d5d

github-actions bot added documentation Improvements or additions to documentation testing Everything test related examples python python script changes server labels Mar 14, 2025

ochafik added 2 commits March 14, 2025 11:55

Merge remote-tracking branch 'origin/master' into tool-diffs

f477288

fix gcc compilation

e0202b3

ochafik force-pushed the tool-diffs branch from 94f8b38 to e0202b3 Compare March 14, 2025 12:07

ochafik added 10 commits March 14, 2025 12:41

fix unreachable code warning after [[noreturn]] annotation

f840e3a

fix / refactor test-regex-partial

af7391e

fix test-chat

449917b

rm spaces

b428b5c

fix command r7b partial parsing (lacked args path)

668fc90

Update test_chat_completion.py

b48ab23

refactor + test chat parser (try_consume_json_with_dumped_args, liter…

aefc8a4

…al based thinking tags parsing)

return partial msg from server

22428a4

refactor partial json

5b9c5a4

don't return empty <think></think>

3fbe84f

ochafik added 4 commits March 23, 2025 11:40

Merge remote-tracking branch 'origin/master' into tool-diffs

1d25178

fix partial json crash after comma

42cb16f

fix test-chat.cpp

37b4a3a

fix gcc build of test

13d725d

ochafik added 3 commits March 26, 2025 00:13

Merge remote-tracking branch 'origin/master' into tool-diffs

a40aead

Merge remote-tracking branch 'origin/master' into tool-diffs

329d943

Merge remote-tracking branch 'origin/master' into tool-diffs

e63e542

ochafik added 4 commits April 4, 2025 10:29

fix regex-partial (drop reluctant repetitions conversions)

21cd34c

partial regex: allow newlines in prefixes

5f0450d

tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

36ecb01

Update function-calling.md

68eeff1

ochafik added 4 commits April 4, 2025 13:32

nit: spaces

12deff6

Update tool_bench.py

d0a686b

Merge remote-tracking branch 'origin/master' into tool-diffs

a604b2d

Inject date_string in llama 3.x + test it & functionary v2

90789cd

ggml-org#12729

github-actions bot added the script Script related label Apr 5, 2025

f-krull reviewed Apr 5, 2025

View reviewed changes

ochafik added 5 commits April 7, 2025 08:50

Inject date_string in llama 3.x + fix for functionary v2

71435cf

ggml-org#12729

add missing chrono include

543b73e

move/fix detection of functionary v3.1 before llama 3.x, fix & test t…

e3c372c

…heir non-tool mode

Merge branch 'date' into tool-diffs

387611a

Merge remote-tracking branch 'origin/master' into tool-diffs

01a3e31

This was referenced Apr 7, 2025

common: add partial regex support #12808

Open

Misc. bug: "Unexpected empty grammar stack after accepting piece" tool crash #12597

Closed

tool-call: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates #12900

Merged

ggerganov added the tool calling label Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik commented Mar 14, 2025 •

edited

Loading

llowrey commented Mar 21, 2025

ochafik commented Mar 23, 2025

Column01 commented Apr 3, 2025 •

edited

Loading

ochafik commented Apr 4, 2025 •

edited

Loading

f-krull Apr 5, 2025

server: streaming of tool calls and thoughts when --jinja is on #12379

Are you sure you want to change the base?

server: streaming of tool calls and thoughts when --jinja is on #12379

Conversation

ochafik commented Mar 14, 2025 • edited Loading

How to test / use

Context

Location, location, location 🏡

Implementation notes

Partial parsing utils

To think or not to think... what is the prompt?

Triggering thoughts 😓

Functionary v3.2 w/ raw python

TODOs

llowrey commented Mar 21, 2025

ochafik commented Mar 23, 2025

Column01 commented Apr 3, 2025 • edited Loading

ochafik commented Apr 4, 2025 • edited Loading

f-krull Apr 5, 2025

Choose a reason for hiding this comment

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

`server`: streaming of tool calls and thoughts when `--jinja` is on #12379

ochafik commented Mar 14, 2025 •

edited

Loading

Column01 commented Apr 3, 2025 •

edited

Loading

ochafik commented Apr 4, 2025 •

edited

Loading