[SpecDecode] Support EAGLE in V1 #15901

WoosukKwon · 2025-04-01T19:45:13Z

ekagra-ranjan · 2025-04-04T20:30:20Z

I will take up Task 3

wwl2755 · 2025-04-06T23:10:32Z

Hi, I would like to give a try on task 4.

WoosukKwon · 2025-04-06T23:47:54Z

Hi @wwl2755, the task 4 has already been addressed by #16087 and 1 is being handled by #16035. Would you be interested in others (5, 6, 7 particularly)?

wwl2755 · 2025-04-06T23:51:05Z

Hi @wwl2755, the task 4 has already been addressed by #16087 and 1 is being handled by #16035. Would you be interested in others (5, 6, 7 particularly)?

Hi @WoosukKwon , Sorry for the typo, I meant task 5 (random seed). I'm happy to take this up.

wwl2755 · 2025-04-09T00:22:00Z

Hi @WoosukKwon , when I tries to understand the task 7, is that mainly because the draft model won't use the first token's cache in its auto-regression head? But it seems f_how (in the figure) is used in the draft model?

From what I could see, we are supposed to take the KV cache from the target model, shift it by 1, and then append to it during the proposing process in the draft model until num_speculative_tokens. Is that correct? Thank you!

EDIT: I re-visited the details of the paper and now understand that f_how and e_can are concatenated (which we can image it as the first token in the draft model) and also f_can and e_I are concatenated (second token), that's why there is a shift-by-1. And the KV cache comes from processing these new-type tokens. Please correct me if my understanding is wrong. Thank you!

Follow-up questions:

Do we consider a tree-based kv cache (multiple possibilities as in the figure) or simply a contiguous one?
It seems the EAGLE is not fully integrated and for now it's hard to make unit tests for the cache. Would it be better to delay this task until we have merged the functional EAGLE proposer into V1?

Second EDIT: After a second thought, I think what I mentioned above should be related to task 2. Prefix caching, I assume, should be related to the same prefix sharing across requests.

Greatpanc · 2025-04-15T23:29:37Z

Which version of Eagle algorithm is implemented in VLLM? Eagle2 or Eagle3?

LiuXiaoxuanPKU · 2025-04-24T21:31:56Z

Which version of Eagle algorithm is implemented in VLLM? Eagle2 or Eagle3?

Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here.

wwl2755 · 2025-04-25T20:11:34Z

Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here.

If no one is working on it, I will look up and work on the tree attention

LiuXiaoxuanPKU · 2025-04-27T04:20:45Z

Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here.

If no one is working on it, I will look up and work on the tree attention

That would be great. Also, since it's a big change, could you have a design doc first so that we can align on the design?

wwl2755 · 2025-04-27T04:22:59Z

Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here.

If no one is working on it, I will look up and work on the tree attention

That would be great. Also, since it's a big change, could you have a design doc first so that we can align on the design?

Yes, for sure. I will draft a WIP PR when I have a detailed design.

wwl2755 · 2025-04-27T06:56:53Z

I didn't make much code change so far. So, I linked the design doc here: https://docs.google.com/document/d/1mMoSicPPMMzaE_T5Zk2SnTderw1OXRUs2T16JxfVGCQ/edit?usp=sharing

Please feel free leave any comments and I will keep this doc updated. Very appreciate it! @LiuXiaoxuanPKU @WoosukKwon

Greatpanc · 2025-04-29T09:14:33Z

The latest vllm 0.8.5 version, the following error occurs, the old version v0.8.4 running eagle ok, can you help me see this problem?

wwl2755 · 2025-04-29T15:43:15Z

Hi @Greatpanc Are you using V1? I'm asking because you are using vllm/model_executor/models/eagle.py which refers to V0. And if the error occurs after switching to V1, could you please attach the reproducible script? Thanks!

Greatpanc · 2025-04-30T10:04:44Z

I use v0, running the qwen2.5 model, v0.8.4 can run through, v0.8.5 does not support, switch to v1 will be scheduled to llama_eagle.py file, is not supported, this normal?

参考代码：`

SPDX-License-Identifier: Apache-2.0

import argparse
import json
import os

from transformers import AutoTokenizer

from vllm import LLM, SamplingParams

def load_prompts(dataset_path, num_prompts):
if os.path.exists(dataset_path):
prompts = []
try:
with open(dataset_path) as f:
for line in f:
data = json.loads(line)
prompts.append(data["turns"][0])
except Exception as e:
print(f"Error reading dataset: {e}")
return []
else:
prompts = [
"The future of AI is", "The president of the United States is"
]

return prompts[:num_prompts]

def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--dataset",
type=str,
default="./examples/data/gsm8k.jsonl",
help="downloaded from the eagle repo "
"https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/"
)
parser.add_argument("--max_num_seqs", type=int, default=8)
parser.add_argument("--num_prompts", type=int, default=80)
parser.add_argument("--num_spec_tokens", type=int, default=2)
parser.add_argument("--tp", type=int, default=1)
parser.add_argument("--draft_tp", type=int, default=1)
parser.add_argument("--enforce_eager", action='store_true')
parser.add_argument("--enable_chunked_prefill", action='store_true')
parser.add_argument("--max_num_batched_tokens", type=int, default=2048)
parser.add_argument("--temp", type=float, default=0)
return parser.parse_args()

def main():

args = parse_args()

model_dir = "Qwen/Qwen2.5-0.5B"
eagle_dir = "Qwen/Qwen2.5-0.5B"

max_model_len = 2048

tokenizer = AutoTokenizer.from_pretrained(model_dir)

prompts = load_prompts(args.dataset, args.num_prompts)

prompt_ids = [
    tokenizer.apply_chat_template([{
        "role": "user",
        "content": prompt
    }],
                                  add_generation_prompt=True)
    for prompt in prompts
]

llm = LLM(
    model=model_dir,
    trust_remote_code=True,
    tensor_parallel_size=args.tp,
    enable_chunked_prefill=args.enable_chunked_prefill,
    max_num_batched_tokens=args.max_num_batched_tokens,
    enforce_eager=args.enforce_eager,
    max_model_len=max_model_len,
    max_num_seqs=args.max_num_seqs,
    gpu_memory_utilization=0.8,
    speculative_config={
        "method": "eagle3" if "eagle3" in eagle_dir.lower() else "eagle",
        "model": eagle_dir,
        "num_speculative_tokens": args.num_spec_tokens,
        "draft_tensor_parallel_size": args.draft_tp,
        "max_model_len": max_model_len,
    },
    disable_log_stats=False,
)

sampling_params = SamplingParams(temperature=args.temp, max_tokens=256)

outputs = llm.generate(prompt_token_ids=prompt_ids,
                       sampling_params=sampling_params)

if not hasattr(outputs, "metrics") or outputs.metrics is None:
    return

# calculate the average number of accepted tokens per forward pass, +1 is
# to account for the token from the target model that's always going to be
# accepted
acceptance_counts = [0] * (args.num_spec_tokens + 1)
for output in outputs:
    for step, count in enumerate(
            output.metrics.spec_token_acceptance_counts):
        acceptance_counts[step] += count

print("-" * 50)
print(f"mean acceptance length: \
    {sum(acceptance_counts) / acceptance_counts[0]:.2f}")
print("-" * 50)

# print acceptance at each token position
for i in range(len(acceptance_counts)):
    print(f"acceptance at token {i}:"
          f"{acceptance_counts[i] / (acceptance_counts[0]):.2f}")

if name == "main":
main()
`

wwl2755 · 2025-05-01T22:42:47Z

@Greatpanc eagle can only be used with the specific group of models that are trained using eagle-method.

See also: https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py#L62 and https://github.com/SafeAILab/EAGLE?tab=readme-ov-file#eagle-weights

ekagra-ranjan · 2025-05-07T17:39:38Z

Compiled all the benchmark we have done so far for EAGLE method in V1 here for easier tracking: #17812

oreo-wjx · 2025-05-09T02:57:15Z

I have a concern on current implementation. In v1 spec_decode with tp > 1, under random sampling, each rank will generate uniform probs in reject_sampler. This approach may lead to varying numbers of accepted tokens across different ranks, which in turn can cause random stalls when I run benchmarks locally. However, in v0 engine, uniform probs will only be generated on driver worker, thereby avoiding such issue. @WoosukKwon @LiuXiaoxuanPKU

WoosukKwon · 2025-05-09T03:59:08Z

@oreo-wjx Thanks for bringing it up. Yeah we are aware of the issue.
Fundamentally, it's because we set the random seed as None by default. Setting the random seed to 0 or any number will fix the issue.

oreo-wjx · 2025-05-09T09:17:13Z

Seems reasonable, btw, what's the plan for supporting mtp, any help needed? @WoosukKwon

Greatpanc · 2025-05-12T03:26:44Z

@wwl2755 I want to run the eagle algorithm on the qwen series model. Can I support the model and generate a draft model myself?

wwl2755 · 2025-05-12T18:31:09Z

@wwl2755 I want to run the eagle algorithm on the qwen series model. Can I support the model and generate a draft model myself?

@Greatpanc IIUC, yes. You could possibly train a new draft model on your own. Ref: https://github.com/SafeAILab/EAGLE

WoosukKwon added this to [V1] Speculative Decoding Apr 1, 2025

WoosukKwon added speculative-decoding v1 labels Apr 1, 2025

LiuXiaoxuanPKU mentioned this issue Apr 3, 2025

[V1][Spec Decode] Eagle Model loading #16035

Merged

ekagra-ranjan mentioned this issue Apr 4, 2025

[V1][Spec Decode] Non greedy sample with EAGLE / Reduce memory allocation for Rejection Sampler #16077

Open

2 tasks

WoosukKwon mentioned this issue Apr 5, 2025

[V1][Spec Decode] Handle draft tokens beyond max_model_len #16087

Merged

wwl2755 mentioned this issue Apr 8, 2025

[V1][Spec Decode] Add random seed for EAGLE and its test script #16235

Open

luyuzhe111 mentioned this issue Apr 10, 2025

[V1] Add request-level, per-step acceptance counts tracking for spec dec. #16367

Open

LiuXiaoxuanPKU mentioned this issue Apr 10, 2025

[V1][Spec Decode] KV cache slots for eagle heads #16370

Merged

ekagra-ranjan mentioned this issue Apr 15, 2025

[V1][Feature] Enable Speculative Decoding with Structured Outputs #14702

Merged

LiuXiaoxuanPKU mentioned this issue Apr 24, 2025

[V1][Spec Decode] Make eagle compatible with prefix caching. #17137

Merged

luyuzhe111 mentioned this issue Apr 26, 2025

[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE #17211

Merged

wwl2755 mentioned this issue May 1, 2025

[WIP][V1][Spec Decode] EAGLE tree-attention #17560

Draft

9 tasks

wwl2755 mentioned this issue May 19, 2025

[Feature]: Tree-Attention Support for Speculative Decoding #18327

Open

1 task

Uh oh!

[SpecDecode] Support EAGLE in V1 #15901

[SpecDecode] Support EAGLE in V1 #15901

Comments

WoosukKwon commented Apr 1, 2025 • edited by LiuXiaoxuanPKU Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ekagra-ranjan commented Apr 4, 2025

Uh oh!

wwl2755 commented Apr 6, 2025

Uh oh!

WoosukKwon commented Apr 6, 2025

Uh oh!

wwl2755 commented Apr 6, 2025

Uh oh!

wwl2755 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Greatpanc commented Apr 15, 2025

Uh oh!

LiuXiaoxuanPKU commented Apr 24, 2025

Uh oh!

wwl2755 commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiuXiaoxuanPKU commented Apr 27, 2025

Uh oh!

wwl2755 commented Apr 27, 2025

Uh oh!

wwl2755 commented Apr 27, 2025

Uh oh!

Greatpanc commented Apr 29, 2025

Uh oh!

wwl2755 commented Apr 29, 2025

Uh oh!

Greatpanc commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SPDX-License-Identifier: Apache-2.0

Uh oh!

wwl2755 commented May 1, 2025

Uh oh!

ekagra-ranjan commented May 7, 2025

Uh oh!

oreo-wjx commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented May 9, 2025

Uh oh!

oreo-wjx commented May 9, 2025

Uh oh!

Greatpanc commented May 12, 2025

Uh oh!

wwl2755 commented May 12, 2025

Uh oh!

WoosukKwon commented Apr 1, 2025 •

edited by LiuXiaoxuanPKU

Loading

wwl2755 commented Apr 9, 2025 •

edited

Loading

wwl2755 commented Apr 25, 2025 •

edited

Loading

Greatpanc commented Apr 30, 2025 •

edited

Loading

oreo-wjx commented May 9, 2025 •

edited

Loading