Skip to content

[V1][Spec Decode] Non greedy sample with EAGLE / Reduce memory allocation for Rejection Sampler #16077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

ekagra-ranjan
Copy link
Contributor

@ekagra-ranjan ekagra-ranjan commented Apr 4, 2025

Task 3 of #15901
Enable non greedy sampling in Eagle

  • avoid intermediate tensor storage for RS buffers which was happening with torch.stack
  • Cache draft_probs inside the model runner and correctly feed it to the rejection sampler in the next step

This PR makes changes to proposer() interface where instead of outputting the draft_token_ids or their prob, it saves them in the internal variable is accessed using getters like get_draft_token_ids() and get_draft_probs()

Copy link

github-actions bot commented Apr 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the v1 label Apr 4, 2025
@ekagra-ranjan ekagra-ranjan marked this pull request as ready for review April 7, 2025 21:30
@ekagra-ranjan ekagra-ranjan changed the title [V1][Spec Decode] Fix and Optimize Rejection Sampler [V1][Spec Decode] Enable non greedy sampling with EAGLE and reduce memory allocation for Rejection Sampler Apr 7, 2025
@LiuXiaoxuanPKU LiuXiaoxuanPKU self-assigned this Apr 8, 2025
Copy link

mergify bot commented Apr 8, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 8, 2025
@mergify mergify bot removed the needs-rebase label Apr 8, 2025
vocab_size,
dtype=torch.float32,
device=device)
self._draft_probs_buffer_shape = self._draft_probs_buffer.shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to point this out, we might need to fix it. For the draft_probs_buffer, it has size (plug in numbers of llama3-8B):
256 * 10 * 128256 * 4 / 1024 / 1024 = 1.3G
It has a low probability that this might trigger OOM if we do this after vLLM preallocates all memory for kv cache. But it should not be a big problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to allocate this before vLLM preallocates memory for KVC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The preallocation of RS buffer is happening here which is when gpuModelRunner is created. Could you point me to which line of code computes the available GPU memory and allocated the KVC on that?


# restore shape of buffers if it has been
# changed by any future operation
if (self._draft_probs_buffer.shape != self._draft_probs_buffer_shape):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why might this buffer be reshaped by any operation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are passing the buffer outside of the function, the caller gets the handle of this buffer and it might accidentally do a reshape. I am assuming that someone might in future do it since its not obvious that they shouldnt do it. The check will help in those cases. Let me know if this check should be removed

device=device)
self._draft_probs_buffer_shape = self._draft_probs_buffer.shape

def get_draft_token_ids(self) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have the get_draft_token_ids API, I'm wondering if it might be cleaner to move all proposing logic (https://github.com/vllm-project/vllm/blob/660a6b0ed756bb7ca0459786fd8302b9ede2c280/vllm/v1/worker/gpu_model_runner.py#L1171C8-L1229C14) under this function?

Copy link
Contributor Author

@ekagra-ranjan ekagra-ranjan Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the get_draft_token_ids is API is more like a getter for the right slice of the preallocated buffer so repeated calls will just give the handle to the buffer. If we move the proposer logic here then repeated calls will propose again. We could refactor the code and add the section under a new API if that makes sense.

Copy link

mergify bot commented Apr 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 10, 2025
@LiuXiaoxuanPKU
Copy link
Collaborator

Also, can we have a test for this PR? Appreciate it!

@mergify mergify bot removed the needs-rebase label Apr 11, 2025
@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Apr 11, 2025

Update: I updated the draft prob buffer layout to be packed. I tested the AL with my metric hack ekagra-ranjan#1 on MTBench (same setup)

With K=2

  • temp=0, AL= 1.90 (prev bench on above setup was 1.89 so it matches)
  • temp=0.3, AL=1.89
  • temp=0.75, AL= 1.86
  • temp=1, AL=1.82

so it seems to work. Next, I will add some unittest.

@ekagra-ranjan ekagra-ranjan changed the title [V1][Spec Decode] Enable non greedy sampling with EAGLE and reduce memory allocation for Rejection Sampler [V1][Spec Decode] Non greedy sample with EAGLE / Reduce memory allocation for Rejection Sampler Apr 11, 2025
@ekagra-ranjan
Copy link
Contributor Author

I added non greedy sampling test to the test_eagle_correctness. However they are failing for T!=0. The number of exact match drops from 100 to 20 when T changes from 0 to 1.0. For T=0.3, 0.75, the matches are around 40. Even ngram test fails when I add non greedy sampling paramter to test. I am not sure if this is a robust way to test.

I also logged the output for the sample prompt and compared SD (eagle) with vanilla output on various Temp below. It shows that while T=0 has exact match, T!=0 does not have exact match bw vanilla and SD however the answer is coherent.

Output for various Temp (0, 0.3, 0.75, 1.0)

0:
SD:
[
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe future of AI is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "What a fascinating topic! The future of AI is likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some potential developments that could shape the future of AI:\n\n1. **Increased Adoption**: AI will become more ubiquitous, with applications in various industries, such as healthcare, finance, education, and transportation.\n2. **Advancements in Machine Learning**: Machine learning algorithms will continue to improve, enabling AI systems to learn from data, adapt to new situations, and make more accurate predictions.\n3. **Natural Language Processing (NLP)**: NLP will become more sophisticated, allowing humans to interact with AI systems using natural language, and enabling AI-powered chatbots and virtual assistants to understand and respond to complex queries.\n4. **Computer Vision**: Computer vision will continue to improve, enabling AI systems to interpret and understand visual data, such as images and videos, and apply this knowledge to various applications, such as self-driving cars and medical diagnosis.\n5. **Edge AI**: Edge AI will become more prevalent, enabling AI processing to occur closer to the source of the data, reducing latency and improving real-time decision-making.\n6. **Explainability and Transparency**: As AI becomes more pervasive, there will be a growing need for"
    },
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe president of the United States is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "As of my knowledge cutoff, the President of the United States is Joe Biden."
    }
]

vanilla:
[
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe future of AI is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "What a fascinating topic! The future of AI is likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some potential developments that could shape the future of AI:\n\n1. **Increased Adoption**: AI will become more ubiquitous, with applications in various industries, such as healthcare, finance, education, and transportation.\n2. **Advancements in Machine Learning**: Machine learning algorithms will continue to improve, enabling AI systems to learn from data, adapt to new situations, and make more accurate predictions.\n3. **Natural Language Processing (NLP)**: NLP will become more sophisticated, allowing humans to interact with AI systems using natural language, and enabling AI-powered chatbots and virtual assistants to understand and respond to complex queries.\n4. **Computer Vision**: Computer vision will continue to improve, enabling AI systems to interpret and understand visual data, such as images and videos, and apply this knowledge to various applications, such as self-driving cars and medical diagnosis.\n5. **Edge AI**: Edge AI will become more prevalent, enabling AI processing to occur closer to the source of the data, reducing latency and improving real-time decision-making.\n6. **Explainability and Transparency**: As AI becomes more pervasive, there will be a growing need for"
    },
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe president of the United States is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "As of my knowledge cutoff, the President of the United States is Joe Biden."
    }
]

0.3
SD:
[

    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe future of AI is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "What a fascinating topic! The future of AI is likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some potential developments that could shape the future of AI:\n\n1. **Increased Adoption**: AI is expected to become even more ubiquitous, with applications in various industries, such as healthcare, finance, transportation, and education.\n2. **Advancements in Machine Learning**: Machine learning algorithms will continue to improve, enabling AI systems to learn from data, adapt to new situations, and make more accurate predictions.\n3. **Natural Language Processing (NLP)**: NLP will become more sophisticated, allowing humans to interact with AI systems more naturally, using voice commands, text, or gestures.\n4. **Computer Vision**: Computer vision will continue to improve, enabling AI systems to interpret and understand visual data, such as images and videos, with greater accuracy.\n5. **Edge AI**: Edge AI, which involves processing data closer to the source, will become more prevalent, reducing latency and improving real-time decision-making.\n6. **Explainability and Transparency**: As AI systems become more complex, there will be a growing need for explainability and transparency, to ensure accountability and trust.\n7. **Human-AI Collaboration**: AI will"
    },
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe president of the United States is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "As of my knowledge cutoff in 2021, the President of the United States is Joe Biden."
    }
]
vanilla:
[
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe future of AI is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "A topic that sparks much excitement and curiosity! The future of AI is likely to be shaped by various factors, including technological advancements, societal needs, and ethical considerations. Here are some potential developments that could shape the future of AI:\n\n1. **Increased Adoption**: AI will become more ubiquitous, with applications in various industries, such as healthcare, finance, education, and transportation.\n2. **Advancements in Machine Learning**: Machine learning algorithms will continue to improve, enabling AI systems to learn from data, adapt to new situations, and make more accurate predictions.\n3. **Natural Language Processing (NLP)**: NLP will become more sophisticated, allowing humans to interact with AI systems more naturally, using voice commands, text, or gestures.\n4. **Computer Vision**: Computer vision will improve, enabling AI systems to interpret and understand visual data, such as images and videos, with greater accuracy.\n5. **Edge AI**: AI will be deployed at the edge of the network, closer to the source of the data, to reduce latency and improve real-time decision-making.\n6. **Explainable AI**: As AI becomes more pervasive, there will be a growing need to understand how AI systems make decisions, leading to the development of explainable AI (XAI) techniques.\n"
    },
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe president of the United States is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "As of my knowledge cutoff, the President of the United States is Joe Biden."
    }
]


0.75
SD:
[
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe future of AI is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "What a fascinating topic! The future of Artificial Intelligence (AI) is expected to be shaped by numerous factors, including advancements in technology, human innovation, and societal needs. Here are some potential developments that could shape the future of AI:\n\n1. **Increased Adoption**: AI is expected to become even more ubiquitous, with applications in various industries, such as healthcare, finance, transportation, and education.\n2. **Advances in Machine Learning**: Machine learning, a subset of AI, will continue to improve, enabling machines to learn from data, make decisions, and adapt to new situations.\n3. **Natural Language Processing (NLP)**: NLP will become more sophisticated, allowing humans to interact with AI systems more naturally, using voice commands, text, or gestures.\n4. **Computer Vision**: Computer vision will continue to improve, enabling AI systems to interpret and understand visual data, such as images and videos.\n5. **Robotics and Robotics Intelligence (RI)**: RI will become more prevalent, enabling robots to perform tasks that require intelligence, dexterity, and flexibility.\n6. **Autonomous Systems**: Autonomous vehicles, drones, and robots will become more common, transforming industries like logistics, transportation, and manufacturing.\n7. **Explainability and Transparency**: As AI"
    },
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe president of the United States is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "As of my knowledge cutoff in August 2022, the President of the United States is Joe Biden."
    }
]

vanilla:
[
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe future of AI is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "What a fascinating topic!\n\nThe future of AI is inherently uncertain, but based on current trends, research, and expert opinions, here are some potential developments that could shape the future of AI:\n\n1. **Artificial General Intelligence (AGI)**: AGI is AI that possesses human-like intelligence, able to reason, learn, and apply knowledge across various tasks. If achieved, AGI could revolutionize many industries and aspects of our lives.\n2. **Explainable AI (XAI)**: As AI becomes more prevalent, there will be a growing need for transparency and explainability. XAI will help users understand AI decision-making processes, increasing trust and accountability.\n3. **Edge AI**: The proliferation of IoT devices and the need for real-time processing will drive the development of Edge AI, which processes data closer to where it's generated, reducing latency and improving efficiency.\n4. **Multimodal AI**: AI will become more adept at processing and understanding various forms of data, such as images, videos, audio, and natural language, enabling more effective human-machine interaction.\n5. **Autonomous Systems**: Autonomous vehicles, drones, and robots will become increasingly common, transforming industries like logistics, transportation, and healthcare.\n6. **AI-Powered Medicine**: AI"
    },
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe president of the United States is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "Joe Biden"
    }
]

1.0
SD:
[
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe future of AI is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "What a great topic!\n\nThe future of AI is full of uncertainty andpromise. Here are some potential developments that could shape the future of AI:\n\n1. **Advancements in Machine Learning**: Machine learning is the foundation of many AI systems. Future advancements in this area could lead to even more sophisticated AI models that can learn faster, better, and more efficiently.\n2. **Edge AI**: With the rise of IoT devices and edge computing, we can expect to see more AI processing happening at the edge, closer to the source of the data, rather than in the cloud.\n3. **Explainable AI (XAI)**: As AI becomes more pervasive, there is a growing need for transparency and explainability. XAI could help us understand how AI models make decisions and reduce the risk of bias and misinformation.\n4. **Human-AI collaboration**: AI will increasingly be used to augment human capabilities, rather than replace them. This could lead to new forms of collaboration between humans and AI, enabling us to work more efficiently and effectively.\n5. **Natural Language Processing (NLP)**: NLP is a key area of AI research, enabling computers to understand and process human language. Expect to see significant advancements in this area, leading to more conversational AI interfaces.\n"
    },
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe president of the United States is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "As of my knowledge cutoff in August 2022, the 46th and current President of the United States is Joe Biden."
    }
]

vanilla:
[
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe future of AI is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "The future of AI is a topic of much debate and speculation! As AI continues to evolve, we can expect to see significant advancements in several areas. Here are some potential developments that could shape the future of AI:\n\n1. **Increased Adoption**: AI will become more ubiquitous, with applications in various industries, including healthcare, finance, education, and more. Expect to see more AI-powered chatbots, virtual assistants, and automated decision-making systems.\n2. **Explainability and Transparency**: As AI systems become more sophisticated, there will be a greater emphasis on explainability and transparency. Researchers will focus on developing methods to understand and interpret AI decision-making processes, improving trust and accountability.\n3. **Edge AI**: The proliferation of IoT devices and edge computing will lead to the development of AI algorithms that can process data in real-time, closer to the source, reducing latency and improving performance.\n4. **Meta-Learning**: AI systems will learn to learn, adapting to new tasks and environments more efficiently. This could revolutionize areas like robotics, gaming, and natural language processing.\n5. **Human-AI Collaboration**: As AI becomes more capable, humans and machines will work together more closely. Expect to see AI-assisted creative work, such as art, music, and writing"
    },
    {
        "input": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nThe president of the United States is<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
        "output": "I'm happy to help! However, I must clarify that the presidency is a dynamic position and is likely to change every four years or sooner due to elections or other circumstances.\n\nAs of my last update (January 2021), the 46th and current President of the United States is Joe Biden. If you're looking for the most current information, I recommend checking out reputable news sources or the official White House website for the latest updates.\n\nIf you have any specific questions or topics related to the presidency or current events, I'm here to help!"
    }
]

@LiuXiaoxuanPKU could you share your thoughts on how was non greedy tested in V0 OR how should we test this PR? Thanks for the review!

@mergify mergify bot added the documentation Improvements or additions to documentation label Apr 11, 2025
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ekagra-ranjan Thanks for the PR!

I haven't got a chance to look through the entire PR, but one thing I noticed is that the logic handling the draft probability is implemented inside the eagle. IIUC, this logic is general for all spec methods (e.g., medusa) so I think it should be implemented as a shared utility.

@ekagra-ranjan
Copy link
Contributor Author

ekagra-ranjan commented Apr 12, 2025

I noticed is that the logic handling the draft probability is implemented inside the eagle. IIUC, this logic is general for all spec methods (e.g., medusa) so I think it should be implemented as a shared utility.

@WoosukKwon - Got it, will make the change. Could you confirm if you are referring to the already existing compute_probs_and_sample_next_token() and the newly added draft buffers OR something else too?

Copy link

mergify bot commented Apr 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@wwl2755
Copy link
Contributor

wwl2755 commented Apr 25, 2025

Just want to check in. What's the status of this PR? Since it is mentioned (#15901 (comment)) , we still should properly handle the non-argmax sampling parameters.

@ekagra-ranjan
Copy link
Contributor Author

@wwl2755 I did some test and am waiting for @LiuXiaoxuanPKU to share if the result look fine. Once confirmed, I will bring this PR upto date with the recent changes that have happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation needs-rebase speculative-decoding v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants