Reasoning effort experiments #12339
Replies: 2 comments 1 reply
-
I found simple prompt engineering can go a long way to influence the R1 distills. If you can't inject a think bootstrap as I showed in #11351 it might be possible to just add to your prompt something like "limit thinking as much as possible". Examples: sally.txt = "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have? Use step by step reasoning to solve problem." Freestyle think (R1 7B), very verbose:
Limit thinking with prompt:
Another deterministic way to do this is to use triggered prompt injection on one of the continuation phrases as I discussed in #11351. I added a TRIG function to accomplish this in my experimental server. It is guaranteed to halt the thinking process. In this example I trigger on one of the continuation phrases as long as 256 tokens have already been generated. In general the TRIG can be an array of trigger records with a whole bunch of continuation phrases the model uses. This most likely will reduce accuracy on harder problems though as models desired generation is truncated and there is no guarantee it is not being halted without correctly solving the (harder) problem.
|
Beta Was this translation helpful? Give feedback.
-
Thanks for your suggestions @steampunque. I'm using the OAI chat completions endpoint of vanilla llama.cpp, so this may limit what I can do. I think one of your suggestions is to supply an incomplete thinking prelude which the autoregressive model will continue as if it had generated it itself, containing the hints about how to think. This feels to me like a refinement of simply giving a system prompt in chat completions. (I tried that superficially, didn't notice much of a difference, but I must confess I didn't try very hard.) To do it this way, I presumably would need to use the (non chat) completions endpoint. I'm not hugely familar with how OAI chat completions decomposes into lower level completions, so it might be hard for me to adapt my chat+instruct+grammar based approach. The other suggestion appears to involve triggers. During generation, upon seeing some state (e.g. some tokens appear, or a certain length of thinking be reached), hijack the current completion, and insert extra text like "Hang on, I've been thinking too long, I better wrap this up" (crudely speaking). I'm not sure how you are doing this. |
Beta Was this translation helpful? Give feedback.
-
I'm trying to find a way to influence thinking time in deepseek distills that emit
<think>...</think>
before the main output. Similar to OAI's reasoning_effort.Motivation: I want to limit the amount of thinking. Often it's thinking for too long. But without disabling it altogether (I can do that with grammar).
I came across this:
#11351
And this:
https://www.reddit.com/r/LocalLLaMA/comments/1j85snw/experimental_control_the_thinking_effort_of_qwq/
I've been playing with llama.cpp to try and get this to work. I'm using the built in web chat interface, with custom JSON config, such as:
(I note it has a different structure to the OAI property of the same name).
It doesn't seem to have any effect, until I increase it to about 15 or 20, when I suddenly get an endless stream of tokens.
Similarly, -15 makes it never generate . You get the same amount of thinking, but it ends with , and no output.
(I guess the threshold of "interpreted as infinity" is around +/- 15 or so. That's fine.)
Smaller numbers (1.5, zero, -1.5) seem to have no effect whatsoever.
As an experiment, I also tried asking "list 10 colours" with:
And I got plenty of Red and red in both the thinking and regular output! Something's not working as I'd expect.


(I also tried with the numerical tokens. No different.)
This is a build from source, about a week ago - 5e43f10.
Beta Was this translation helpful? Give feedback.
All reactions