-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[SpecDecode] Support EAGLE in V1 #15901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I will take up Task 3 |
Hi, I would like to give a try on task 4. |
Hi @WoosukKwon , Sorry for the typo, I meant task 5 (random seed). I'm happy to take this up. |
Hi @WoosukKwon , when I tries to understand the task 7, is that mainly because the draft model won't use the first token's cache in its auto-regression head? But it seems From what I could see, we are supposed to take the KV cache from the target model, shift it by 1, and then append to it during the proposing process in the draft model until EDIT: I re-visited the details of the paper and now understand that Follow-up questions:
Second EDIT: After a second thought, I think what I mentioned above should be related to task 2. Prefix caching, I assume, should be related to the same prefix sharing across requests. ![]() |
Which version of Eagle algorithm is implemented in VLLM? Eagle2 or Eagle3? |
Currently, just eagle 2 without tree attention. Eagle 3 support will be merged soon, here. |
If no one is working on it, I will look up and work on the tree attention |
That would be great. Also, since it's a big change, could you have a design doc first so that we can align on the design? |
Yes, for sure. I will draft a WIP PR when I have a detailed design. |
I didn't make much code change so far. So, I linked the design doc here: https://docs.google.com/document/d/1mMoSicPPMMzaE_T5Zk2SnTderw1OXRUs2T16JxfVGCQ/edit?usp=sharing Please feel free leave any comments and I will keep this doc updated. Very appreciate it! @LiuXiaoxuanPKU @WoosukKwon |
Hi @Greatpanc Are you using V1? I'm asking because you are using vllm/model_executor/models/eagle.py which refers to V0. And if the error occurs after switching to V1, could you please attach the reproducible script? Thanks! |
I use v0, running the qwen2.5 model, v0.8.4 can run through, v0.8.5 does not support, switch to v1 will be scheduled to llama_eagle.py file, is not supported, this normal? 参考代码:` SPDX-License-Identifier: Apache-2.0import argparse from transformers import AutoTokenizer from vllm import LLM, SamplingParams def load_prompts(dataset_path, num_prompts):
def parse_args(): def main():
if name == "main": |
@Greatpanc eagle can only be used with the specific group of models that are trained using eagle-method. See also: https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py#L62 and https://github.com/SafeAILab/EAGLE?tab=readme-ov-file#eagle-weights |
Compiled all the benchmark we have done so far for EAGLE method in V1 here for easier tracking: #17812 |
I have a concern on current implementation. In v1 spec_decode with tp > 1, under random sampling, each rank will generate uniform probs in reject_sampler. This approach may lead to varying numbers of accepted tokens across different ranks, which in turn can cause random stalls when I run benchmarks locally. However, in v0 engine, uniform probs will only be generated on driver worker, thereby avoiding such issue. @WoosukKwon @LiuXiaoxuanPKU |
@oreo-wjx Thanks for bringing it up. Yeah we are aware of the issue. |
Seems reasonable, btw, what's the plan for supporting mtp, any help needed? @WoosukKwon |
@wwl2755 I want to run the eagle algorithm on the qwen series model. Can I support the model and generate a draft model myself? |
@Greatpanc IIUC, yes. You could possibly train a new draft model on your own. Ref: https://github.com/SafeAILab/EAGLE |
Uh oh!
There was an error while loading. Please reload this page.
draft_probs
inside the model runner and correctly feed it to the rejection sampler in the next step (temporarily workaround: [V1][Spec Decode] Always use argmax for sampling draft tokens #16899)max_pos_embeddings
Originally posted by @WoosukKwon in #15729 (comment)
The text was updated successfully, but these errors were encountered: