Skip to content

[SpecDecode] Support EAGLE in V1 #15901

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
7 of 10 tasks
WoosukKwon opened this issue Apr 1, 2025 · 21 comments
Open
7 of 10 tasks

[SpecDecode] Support EAGLE in V1 #15901

WoosukKwon opened this issue Apr 1, 2025 · 21 comments

Comments

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Apr 1, 2025

  • 1. Correctly initializing and loading the EAGLE draft model
  • 2. Consider the lookahead slots in the KV cache manager
  • 3. Cache draft_probs inside the model runner and correctly feed it to the rejection sampler in the next step (temporarily workaround: [V1][Spec Decode] Always use argmax for sampling draft tokens  #16899)
  • 4. Handle the edge cases like when the draft model generates beyond max_pos_embeddings
  • 5. Handle the seeds correctly
  • 6. Do E2E correctness and performance tests
  • 7. Support prefix caching. Eagle requires special handling because Eagle's i-th KV cache is coupled with the i+1-th token ID. (@LiuXiaoxuanPKU)
  • 8. Properly handle the sampling parameters that are not (currently) compatible with spec decoding (e.g., min_p).
  • 9. Use CUDA graphs for draft model. (@luyuzhe111)
  • 10. Support Eagle 3 ([V1][Spec Decode] EAGLE-3 Support #16937)

Originally posted by @WoosukKwon in #15729 (comment)

@ekagra-ranjan
Copy link
Contributor

I will take up Task 3

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 6, 2025

Hi, I would like to give a try on task 4.

@WoosukKwon
Copy link
Collaborator Author

Hi @wwl2755, the task 4 has already been addressed by #16087 and 1 is being handled by #16035. Would you be interested in others (5, 6, 7 particularly)?

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 6, 2025

Hi @wwl2755, the task 4 has already been addressed by #16087 and 1 is being handled by #16035. Would you be interested in others (5, 6, 7 particularly)?

Hi @WoosukKwon , Sorry for the typo, I meant task 5 (random seed). I'm happy to take this up.

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 9, 2025

Hi @WoosukKwon , when I tries to understand the task 7, is that mainly because the draft model won't use the first token's cache in its auto-regression head? But it seems f_how (in the figure) is used in the draft model?

From what I could see, we are supposed to take the KV cache from the target model, shift it by 1, and then append to it during the proposing process in the draft model until num_speculative_tokens. Is that correct? Thank you!

EDIT: I re-visited the details of the paper and now understand that f_how and e_can are concatenated (which we can image it as the first token in the draft model) and also f_can and e_I are concatenated (second token), that's why there is a shift-by-1. And the KV cache comes from processing these new-type tokens. Please correct me if my understanding is wrong. Thank you!

Follow-up questions:

  1. Do we consider a tree-based kv cache (multiple possibilities as in the figure) or simply a contiguous one?
  2. It seems the EAGLE is not fully integrated and for now it's hard to make unit tests for the cache. Would it be better to delay this task until we have merged the functional EAGLE proposer into V1?

Second EDIT: After a second thought, I think what I mentioned above should be related to task 2. Prefix caching, I assume, should be related to the same prefix sharing across requests.

Image

@Greatpanc
Copy link

Which version of Eagle algorithm is implemented in VLLM? Eagle2 or Eagle3?

@LiuXiaoxuanPKU
Copy link
Collaborator

Which version of Eagle algorithm is implemented in VLLM? Eagle2 or Eagle3?

Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here.

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 25, 2025

Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here.

If no one is working on it, I will look up and work on the tree attention

@LiuXiaoxuanPKU
Copy link
Collaborator

Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here.

If no one is working on it, I will look up and work on the tree attention

That would be great. Also, since it's a big change, could you have a design doc first so that we can align on the design?

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 27, 2025

Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here.

If no one is working on it, I will look up and work on the tree attention

That would be great. Also, since it's a big change, could you have a design doc first so that we can align on the design?

Yes, for sure. I will draft a WIP PR when I have a detailed design.

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 27, 2025

I didn't make much code change so far. So, I linked the design doc here: https://docs.google.com/document/d/1mMoSicPPMMzaE_T5Zk2SnTderw1OXRUs2T16JxfVGCQ/edit?usp=sharing

Please feel free leave any comments and I will keep this doc updated. Very appreciate it! @LiuXiaoxuanPKU @WoosukKwon

@Greatpanc
Copy link

ImageThe latest vllm 0.8.5 version, the following error occurs, the old version v0.8.4 running eagle ok, can you help me see this problem?

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 29, 2025

Hi @Greatpanc Are you using V1? I'm asking because you are using vllm/model_executor/models/eagle.py which refers to V0. And if the error occurs after switching to V1, could you please attach the reproducible script? Thanks!

@Greatpanc
Copy link

Greatpanc commented Apr 30, 2025

I use v0, running the qwen2.5 model, v0.8.4 can run through, v0.8.5 does not support, switch to v1 will be scheduled to llama_eagle.py file, is not supported, this normal?

参考代码:`

SPDX-License-Identifier: Apache-2.0

import argparse
import json
import os

from transformers import AutoTokenizer

from vllm import LLM, SamplingParams

def load_prompts(dataset_path, num_prompts):
if os.path.exists(dataset_path):
prompts = []
try:
with open(dataset_path) as f:
for line in f:
data = json.loads(line)
prompts.append(data["turns"][0])
except Exception as e:
print(f"Error reading dataset: {e}")
return []
else:
prompts = [
"The future of AI is", "The president of the United States is"
]

return prompts[:num_prompts]

def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
"--dataset",
type=str,
default="./examples/data/gsm8k.jsonl",
help="downloaded from the eagle repo "
"https://github.com/SafeAILab/EAGLE/blob/main/eagle/data/"
)
parser.add_argument("--max_num_seqs", type=int, default=8)
parser.add_argument("--num_prompts", type=int, default=80)
parser.add_argument("--num_spec_tokens", type=int, default=2)
parser.add_argument("--tp", type=int, default=1)
parser.add_argument("--draft_tp", type=int, default=1)
parser.add_argument("--enforce_eager", action='store_true')
parser.add_argument("--enable_chunked_prefill", action='store_true')
parser.add_argument("--max_num_batched_tokens", type=int, default=2048)
parser.add_argument("--temp", type=float, default=0)
return parser.parse_args()

def main():

args = parse_args()

model_dir = "Qwen/Qwen2.5-0.5B"
eagle_dir = "Qwen/Qwen2.5-0.5B"

max_model_len = 2048

tokenizer = AutoTokenizer.from_pretrained(model_dir)

prompts = load_prompts(args.dataset, args.num_prompts)

prompt_ids = [
    tokenizer.apply_chat_template([{
        "role": "user",
        "content": prompt
    }],
                                  add_generation_prompt=True)
    for prompt in prompts
]

llm = LLM(
    model=model_dir,
    trust_remote_code=True,
    tensor_parallel_size=args.tp,
    enable_chunked_prefill=args.enable_chunked_prefill,
    max_num_batched_tokens=args.max_num_batched_tokens,
    enforce_eager=args.enforce_eager,
    max_model_len=max_model_len,
    max_num_seqs=args.max_num_seqs,
    gpu_memory_utilization=0.8,
    speculative_config={
        "method": "eagle3" if "eagle3" in eagle_dir.lower() else "eagle",
        "model": eagle_dir,
        "num_speculative_tokens": args.num_spec_tokens,
        "draft_tensor_parallel_size": args.draft_tp,
        "max_model_len": max_model_len,
    },
    disable_log_stats=False,
)

sampling_params = SamplingParams(temperature=args.temp, max_tokens=256)

outputs = llm.generate(prompt_token_ids=prompt_ids,
                       sampling_params=sampling_params)

if not hasattr(outputs, "metrics") or outputs.metrics is None:
    return

# calculate the average number of accepted tokens per forward pass, +1 is
# to account for the token from the target model that's always going to be
# accepted
acceptance_counts = [0] * (args.num_spec_tokens + 1)
for output in outputs:
    for step, count in enumerate(
            output.metrics.spec_token_acceptance_counts):
        acceptance_counts[step] += count

print("-" * 50)
print(f"mean acceptance length: \
    {sum(acceptance_counts) / acceptance_counts[0]:.2f}")
print("-" * 50)

# print acceptance at each token position
for i in range(len(acceptance_counts)):
    print(f"acceptance at token {i}:"
          f"{acceptance_counts[i] / (acceptance_counts[0]):.2f}")

if name == "main":
main()
`

@wwl2755
Copy link
Contributor

wwl2755 commented May 1, 2025

@Greatpanc eagle can only be used with the specific group of models that are trained using eagle-method.

See also: https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py#L62 and https://github.com/SafeAILab/EAGLE?tab=readme-ov-file#eagle-weights

@ekagra-ranjan
Copy link
Contributor

Compiled all the benchmark we have done so far for EAGLE method in V1 here for easier tracking: #17812

@oreo-wjx
Copy link

oreo-wjx commented May 9, 2025

I have a concern on current implementation. In v1 spec_decode with tp > 1, under random sampling, each rank will generate uniform probs in reject_sampler. This approach may lead to varying numbers of accepted tokens across different ranks, which in turn can cause random stalls when I run benchmarks locally. However, in v0 engine, uniform probs will only be generated on driver worker, thereby avoiding such issue. @WoosukKwon @LiuXiaoxuanPKU

@WoosukKwon
Copy link
Collaborator Author

@oreo-wjx Thanks for bringing it up. Yeah we are aware of the issue.
Fundamentally, it's because we set the random seed as None by default. Setting the random seed to 0 or any number will fix the issue.

@oreo-wjx
Copy link

oreo-wjx commented May 9, 2025

Seems reasonable, btw, what's the plan for supporting mtp, any help needed? @WoosukKwon

@Greatpanc
Copy link

@wwl2755 I want to run the eagle algorithm on the qwen series model. Can I support the model and generate a draft model myself?

@wwl2755
Copy link
Contributor

wwl2755 commented May 12, 2025

@wwl2755 I want to run the eagle algorithm on the qwen series model. Can I support the model and generate a draft model myself?

@Greatpanc IIUC, yes. You could possibly train a new draft model on your own. Ref: https://github.com/SafeAILab/EAGLE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

6 participants