-
Notifications
You must be signed in to change notification settings - Fork 11.4k
server
: streaming of tool calls and thoughts when --jinja
is on
#12379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…al based thinking tags parsing)
Thanks for the quick response @ochafik Here's the console output:
Here's the POST body that causes this: crash.json I can send with postman and get a crash every time After more experimenting, it works most of the time. I just got unlucky with my first attempt. I really appreciate the work you are doing and hope this info helps. |
@llowrey that specific crash should now be fixed, thanks again for the full details! |
Trying this out for myself, specifically the streamed tool calls using Qwen2.5 14B I get the following behavior There is no error in the llama-server log but here it is: https://gist.github.com/Column01/bdce2d58e53e2d440d8bb3f124e64131 |
@Column01 thanks for sharing this! I would really advise against extreme KV quantizations (esp. K) as it seems to severely degrade tool call performance in most models I tested (in your case just switching to a less aggressive (I've updated docs/function-calling.md accordingly in this branch; also, tied up a few more loose ends that should make the Qwen2.5 14B experience smoother, please give it another go if you have a chance!) |
@@ -3,6 +3,7 @@ | |||
#pragma once | |||
|
|||
#include "common.h" | |||
#include <functional> | |||
#include <string> | |||
#include <vector> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+#include <chrono>
…heir non-tool mode
This PR is still WIP (see todos at the bottom) but welcoming early feedback / testing
<think>
reasoning content inside the content (same output for all thinking models when using the default--reasoning-content deepseek
, even for those not using the<think>
syntax like Command R7B), and even if the<think>
tag was added at the end of the prompt by the template (as for DeepSeek R1 & QwQ).{"code": "json-encoded code"}
for multiline programs)This fixes #12107, #10920, #11861
Follow up to #9639
How to test / use
Get and build this PR's branch
Run
llama-server
w/ any model (see more details in the tool calling docs; note that some GGUFs require a chat template override!):Call the chat completions endpoint in streamed mode with any OpenAI-compatible library, or plain curl:
You can also open http://localhost:8080/ to see thoughts being streamed back properly even for models which template add an opening
<think>
tag to the end of the prompt (QwQ, now DeepSeek R1 too although most GGUFs have their initial version) and models like Cohere Command R7B that natively use a different thinking tags syntax (now normalized, since—reasoning-format deepseek
is the default)Context
Supporting OpenAI's streaming delta format was a bit tricky, as it returns chunks of JSON-encoded arguments for each function call, but that's not necessarily what models give us.
While tool calls are returned in a standard format, each w/ a function name, tool call id and JSON encoded arguments, model outputs vary greatly in their syntax. That syntax mostly uses JSON for arguments but not always.
Function calls and their arguments can be at various levels:
[TOOL_CALLS][{"name": "special_function", "arguments": {"arg1": 1}, "id": "123456789"}]
)<tool_call>{"name": "special_function", "arguments": {"arg1": 1}}</tool_call>
; note that some models use other keys here, e.g.tool_name
,parameters
, and may have the tool call id too)<|tool▁calls▁begin|><|tool▁call▁begin|>function<|tool▁sep|>special_function\n```json\n{"arg1": 1}\n```<|tool▁call▁end|><|tool▁calls▁end|>
, or functionary v3.2:special_function\n{"arg1": 1}
){"tool_call": {"name": "special_function", "arguments": {"arg1": 1}}}
(or insidetool_calls
array ifparallel_tool_calls
is on)python
tool call, with two variants:<|python_tag|>multiline python code here
(functionary v3.1),python\nmultiline python code here
(functionary v3.2; w/ prefix>>>
if after textual response)<|python_tag|>python.call(code="multiline\npython\ncode\nhere")
Side note about raw python code:
<|python_tag>foo.call(bar="baz")
in Llama 3.x style will return"tool_calls": [{"name": "foo", "arguments": "{\"bar\": \"baz\"}"}]
, while the same output from Functionary would be parsed as"tool_calls": [{"name": "python", "arguments": "{\"code\": \"foo.call(bar=\\\"baz\\\")\"}"}]
.Now when streaming, we may have sampled only a prefix of the aforementioned output, and want ideally to parse what can be parsed out of it, and send a JSON-encoded arguments object that is cut at a safe place, so that the sum of all the deltas adds up to the full arguments JSON string.
(A primary use case for partial JSON arguments streaming is streaming large multiline diff tool arguments in tools such as RooCode / Cline / Cursor)
The cleanest option would have been to create a unified parser / state machine that can be drip-fed tokens, and preserve its state in the server slot. But I figured the complexity was too high for now (see notes on speeding up below), and instead I've implemented something definitely inefficient but relatively simple (chat.cpp it still the same size): for every token coming in, I try and parse the entire output so far, with partial regex & json parsing support, which allows recovering cleanly cut-off JSON-encoded function arguments (regardless of the original format of said arguments). I then compare the full
common_chat_msg
against the last one we sent back, and compute OpenAI-compatible deltas out of this.Location, location, location 🏡
Note that the output of the model may be truncated (max token output length reached or streaming in progress), and that may fall inside an expected literal (e.g.
<think>
isn't a single token on QwQ-32B), inside a regex (used for some matchers), or inside some JSON.But more interesting is where it happens, esp. for partial JSON:
tests/test-chat-parser.cpp should make this a bit clearer, and I'm in the process of adding partial examples w/ the actual formats in tests/test-chat.cpp (look out for
/* is_partial= */ true
)See examples of streamed tool call deltas
Implementation notes
Partial parsing utils
I added a
common_chat_msg_parser
utility with syntax reminiscent of @ngxson's suggestions in #11607 (comment), but relying on control flow to allow more flexibility:common_regex
(seecommon/regex-partial.cpp
)./abc/
gives/((?:(?:c)?b)?a)[\s\S]*/
, with a single capturing group which end indicates - in reverse - where the partial match started)nlohmann/json
's SAX interface to build location awareness / stack to know how to heal a JSON that fails to parseconsume_json
accepts a list of json paths under which to expect arguments objects; could be from the root = empty path if the entire json object is an arguments object)try_*
parsing methods. This makes the code relatively easy to read and debug. No exotic syntax (apart fromoptional
s, they really help here imho), which should make it easier to convert to coroutines when we wanna make it all incremental.This allows parsing of partial model outputs, whether in streaming mode or when reaching the token limit (currently, tool calls give ugly unparsed outputs when
finish_reason
!=tool_call
).To think or not to think... what is the prompt?
I've also introduced
common_chat_syntax
which wrapscommon_reasoning_format
,common_chat_format
together with:thinking_forced_open
: whether the prompt was detected to end w/ a (model-specific)<think>
tag to force thinking modereasoning_in_content
: whether the thinking tags should be left in the content, which is currently the case in streaming mode as the DeepSeek API does.This allows streaming back a standard
<think>...
syntax even for models that use a different set of tags (e.g. Command R7B). And of course,--reasoning-format none
is still allowed to get the raw output.Note: Ideally, we'd stream the thoughts as a
reasoning_content
delta (now trivial to implement), but for now we are just aiming for compatibility w/ DeepSeek's API (if--reasoning-format deepseek
, which is the default).Triggering thoughts 😓
I noticed DeepSeek R1 Qwen 7B sometimes obsesses over the tool call syntax and "thinks" about how it's gonna call it... which triggers the lazy grammars for said calls before the thoughts are closed.
To address this, I made it possible for
common_chat_templates_apply
to create trigger regexes that match on the entire output (this was already the case in the sampler).COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL
(renamed from_START
) is now expected to have a single capturing group from the start of which the grammar sampler will be activated.Functionary v3.2 w/ raw python
Ask
bartowski/functionary-small-v3.2-GGUF:Q4_K_M
to write a hello world in Python and it outputspython\n{"code": "print('hey')"}
.But ask it to print a hello world in python w/ matplotlib, and it uses its raw multiline python syntax
python\nprint('hey')\n# many other lines
. This is now supported.TODOs
tool-call
: ensure there's always a non-empty tool call id #12292logprobs
for tools mode (right now, forbidden; we don't return diffs for every token, for instance if a function name is in multiple tokens we don't want to send its name in chunks)llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L
)common_regex
) as separate PR:common
: add partial regex support #12808common_json
) as separate PR(?) or fold intochat-parser.cpp
<|START_RESPONSE|>
at the end of the prompt. Output will contain an<|END_RESPONSE|>
that needs handling (would fit nicely in newcommon_chat_syntax
struct). Maybe combine w/ forced/disabled thinking modes as a follow up PRscripts/tool_bench.sh
to compare againstmaster
(+ compare timings)Future follow ups:
cc/ @jpohhhh