NVIDIA
diff --git a/‎examples/scaffolding/contrib/Dynasor/README.md
+56 b/‎examples/scaffolding/contrib/Dynasor/README.md
+56
diff --git a/‎examples/scaffolding/contrib/Dynasor/__init__.py b/‎examples/scaffolding/contrib/Dynasor/__init__.py
diff --git a/‎examples/scaffolding/contrib/Dynasor/dynasor_controller.py
+160 b/‎examples/scaffolding/contrib/Dynasor/dynasor_controller.py
+160
@@ -0,0 +1,56 @@
+# Dynasor
+
+This document shows how to speed up reasoning models without training or fine‑tuning by using **Dynasor** ([Efficiently Serving LLM Reasoning Programs with Certaindex](https://arxiv.org/abs/2412.20993)) in TensorRT‑LLM.
+
+## Overview
+
+Reasoning models often exhibit poor token efficiency, wasting tokens by second‑guessing themselves. **Dynasor** is a certainty‑based approach that dynamically allocates inference compute for reasoning models and stops inference as soon as the LLM has enough information to make a decision.
+
+Currently, this folder provides only **Dynasor‑CoT**, which applies Chain‑of‑Thought (CoT) reasoning. It optimizes models such as `Deepseek‑R1` and its distilled variants. Support for additional reasoning algorithms (Self‑Consistency, Monte Carlo Tree Search, and Rebase) will be added later.
+
+## Usage
+
+The core logic for **Dynasor‑CoT** lives in the `DynasorGenerationController` class in `dynasor_controller.py`. It extends the base `Controller` and implements certainty‑based stopping.
+
+You can adjust the compute‑saving level by initializing `DynasorGenerationController` with different values for:
+
+- `certainty_threshold`: Number of consecutive identical and confident probe answers required to consider the generation as certain.
+- `chunk_size`: Number of tokens to generate per proposal round.
+
+Lowering either value saves more tokens but may risk accuracy.
+
+### Quick Start
+
+1. **Basic usage**
+  `DynasorGenerationController` is a compute‑saving alternative to `NativeGenerationController`. To try it, run:
+   ```bash
+   python examples/scaffolding/contrib/Dynasor/scaffolding_dynasor_run.py
+   ```
+
+2. **Add aggregation method**
+  You can wrap `DynasorGenerationController` with other controllers—for example, `MajorityVoteController` to perform majority voting:
+    ```bash
+    python examples/scaffolding/contrib/Dynasor/scaffolding_dynasor_run.py --majority_vote
+    ```
+
+ ## References
+
+ - Blog post - Dynasor: More Efficient Chain-of-Thought Through Certainty Probing: https://hao-ai-lab.github.io/blogs/dynasor-cot/
+ - Paper - https://arxiv.org/abs/2412.20993
+ - Codebase - https://github.com/hao-ai-lab/Dynasor
+
+ If you use Dynasor for your research, please cite our [paper](https://arxiv.org/abs/2412.20993):
+ ```
+ @article{fu2024efficiently,
+   title={Efficiently Serving LLM Reasoning Programs with Certaindex},
+   author={Fu, Yichao and Chen, Junda and Zhu, Siqi and Fu, Zheyu and Dai, Zhongdongming and Qiao, Aurick and Zhang, Hao},
+   journal={arXiv preprint arXiv:2412.20993},
+   year={2024}
+ }
+ ```
+
+## Acknowledgments
+
+**Dynasor** in TensorRT‑LLM is built upon the `tensorrt_llm/scaffolding` framework, which supports a variety of inference‑time compute methods—such as chain‑of‑thought, majority voting, best‑of‑N sampling, MCTS, and more. We’re grateful to the original `scaffolding` contributors for their excellent work.
+
+If you’re researching in this area and interested in extending it, you’re warmly invited to contribute your own inference‑time compute methods to `scaffolding`.
@@ -0,0 +1,160 @@
+from enum import Enum
+from typing import List
+
+from evaluator import equal_group
+from transformers import AutoTokenizer
+
+from tensorrt_llm.scaffolding import Controller, GenerationTask
+
+
+class DynasorGenerationController(Controller):
+
+    class WorkerTag(Enum):
+        GENERATION = "generation_with_dynasor_cot"
+
+    # Certainty_threshold and chunk_size controls the compute saving level
+    # Decreasing the certainty_threshold and chunk_size will save tokens but may risk at compromising accuracy.
+    def __init__(self,
+                 generation_dir,
+                 max_tokens=8192,
+                 certainty_threshold=3,
+                 chunk_size=64):
+        """
+        Initializes the controller with parameters controlling token limits and certainty thresholds.
+
+        Args:
+            max_tokens (int): Maximum number of tokens to generate in total.
+            certainty_threshold (int): Number of consecutive identical and confident probe answers
+                                       required to consider the generation as certain.
+            chunk_size (int): Number of tokens to generate per proposal round.
+        """
+        super().__init__()
+        self.generation_dir = generation_dir
+        self.max_tokens = max_tokens
+        self.certainty_threshold = certainty_threshold
+        self.chunk_size = chunk_size
+        self.uncertain_words = ["wait", "hold", "but", "okay", "no", "hmm"]
+        self.probe_suffix = "... Oh, I suddenly got the answer to the whole problem, **Final Answer**\n\n\\[ \\boxed{"
+        self.answer_suffix = "\n\n... Oh, I have got the answer to the whole problem\n**Final Answer:**\n\\[\n \\boxed{"
+        self.answer_suffix_with_marker = "\n\n...</think>\n Oh, I have got the answer to the whole problem\n**Final Answer:**\n\\[\n \\boxed{"
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            self.generation_dir,
+            legacy=False,
+            padding_side='left',
+            truncation_side='left',
+            trust_remote_code=False,
+            use_fast=True,
+        )
+
+    def process(self, tasks: List[GenerationTask], **kwargs):
+        """
+        Process the generation task using an iterative approach:
+        1. Generate a probe response with an extra suffix to simulate chain-of-thought.
+        2. Evaluate the probe response to extract a potential answer.
+        3. Check for consistency over several rounds (using certainty_threshold).
+        4. If consistent, finalize the answer and return. Otherwise, continue appending new proposals.
+
+        Args:
+            tasks (List[GenerationTask]): A list of generation tasks to process.
+                                          The first task is assumed to hold the initial prompt.
+
+        Yields:
+            A list of GenerationTask objects to be executed in further processing steps.
+        """
+        # Start with the initial prompt provided by the first task.
+        initial_prompt = tasks[0].input_str
+
+        proposer_task = GenerationTask()
+        proposer_task.max_tokens = self.chunk_size
+        proposer_task.temperature = 0.6
+        proposer_task.top_p = 0.95
+        proposer_task.worker_tag = self.WorkerTag.GENERATION
+
+        probe_task = GenerationTask()
+        probe_task.max_tokens = 20
+        probe_task.temperature = 0.6
+        probe_task.top_p = 0.95
+        probe_task.worker_tag = self.WorkerTag.GENERATION
+
+        probe_answers = []
+        probe_responses = []
+
+        initial_prompt_token_num = len(
+            self.tokenizer.encode(initial_prompt, add_special_tokens=False))
+        probe_suffix_token_num = len(
+            self.tokenizer.encode(self.probe_suffix, add_special_tokens=False))
+
+        current_prompt = initial_prompt
+
+        # Iterate over generation rounds until the maximum tokens limit is reached.
+        # Make sure length of prefilling is always smaller than the max_tokens in TRTLLMWorker.init_with_new_llm
+        # Otherwise it will through an assertion fail, stated in issue #3576
+        for _ in range(initial_prompt_token_num + probe_suffix_token_num,
+                       self.max_tokens, self.chunk_size):
+            proposer_task.input_str = current_prompt
+            probe_task.input_str = current_prompt + self.probe_suffix
+
+            # For the probe task, append the suffix to force a chain-of-thought leading to an answer.
+            yield [probe_task]
+
+            # Retrieve the output from the probe task.
+            probe_text = probe_task.output_str
+
+            # Extract the potential answer from the probe response.
+            answer = self.obtain_answer(probe_text)
+            probe_answers.append(answer)
+            probe_responses.append(probe_text)
+
+            # Determine if the last few probe responses are considered confident enough.
+            # A response is flagged as confident if it does not contain any of the uncertain words.
+            probe_certain_count = [
+                not any(word in res.lower() for word in self.uncertain_words)
+                for res in probe_responses[-self.certainty_threshold:]
+            ]
+
+            # Check if the last 'certainty_threshold' probe answers are identical (by equal_group)
+            # and they are not empty, and all responses are confident.
+            if (equal_group(probe_answers[-self.certainty_threshold:])
+                    and self.count_not_empty(
+                        probe_answers[-self.certainty_threshold:])
+                    == self.certainty_threshold
+                    and sum(probe_certain_count) == self.certainty_threshold):
+                # If the current prompt indicates the chain-of-thought phase has ended, use one type of suffix.
+                if "</think>" in current_prompt:
+                    tasks[0].output_str = (current_prompt + self.answer_suffix +
+                                           probe_answers[-1] + "}\n\\]")
+                    return
+                else:
+                    # Otherwise, use the suffix with marker to transition clearly.
+                    tasks[0].output_str = (current_prompt +
+                                           self.answer_suffix_with_marker +
+                                           probe_answers[-1] + "}\n\\]")
+                    return
+
+            # if not confident, do another round of generation
+            yield [proposer_task]
+
+            # Append the newly generated text from the proposer to the current prompt for the next iteration.
+            current_prompt += proposer_task.output_str
+
+        # If the maximum token limit is reached without satisfying the certainty condition,
+        # output the accumulated prompt as the final output.
+        tasks[0].output_str = current_prompt
+        return
+
+    @staticmethod
+    def obtain_answer(s):
+        # Find first unpaired } by counting { and }
+        stack = []
+        for i, c in enumerate(s):
+            if c == "{":
+                stack.append(c)
+            elif c == "}":
+                if not stack:  # No matching { found
+                    return s[:i]
+                stack.pop()
+        return ""
+
+    @staticmethod
+    def count_not_empty(answers):
+        return sum(1 for answer in answers if answer != "")