-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[V1][Spec Decode] Non greedy sample with EAGLE / Reduce memory allocation for Rejection Sampler #16077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: ekagra <[email protected]>
Signed-off-by: ekagra <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: ekagra <[email protected]>
vocab_size, | ||
dtype=torch.float32, | ||
device=device) | ||
self._draft_probs_buffer_shape = self._draft_probs_buffer.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to point this out, we might need to fix it. For the draft_probs_buffer
, it has size (plug in numbers of llama3-8B):
256 * 10 * 128256 * 4 / 1024 / 1024 = 1.3G
It has a low probability that this might trigger OOM if we do this after vLLM preallocates all memory for kv cache. But it should not be a big problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to allocate this before vLLM preallocates memory for KVC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The preallocation of RS buffer is happening here which is when gpuModelRunner is created. Could you point me to which line of code computes the available GPU memory and allocated the KVC on that?
|
||
# restore shape of buffers if it has been | ||
# changed by any future operation | ||
if (self._draft_probs_buffer.shape != self._draft_probs_buffer_shape): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why might this buffer be reshaped by any operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are passing the buffer outside of the function, the caller gets the handle of this buffer and it might accidentally do a reshape. I am assuming that someone might in future do it since its not obvious that they shouldnt do it. The check will help in those cases. Let me know if this check should be removed
device=device) | ||
self._draft_probs_buffer_shape = self._draft_probs_buffer.shape | ||
|
||
def get_draft_token_ids(self) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have the get_draft_token_ids
API, I'm wondering if it might be cleaner to move all proposing logic (https://github.com/vllm-project/vllm/blob/660a6b0ed756bb7ca0459786fd8302b9ede2c280/vllm/v1/worker/gpu_model_runner.py#L1171C8-L1229C14) under this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the get_draft_token_ids
is API is more like a getter for the right slice of the preallocated buffer so repeated calls will just give the handle to the buffer. If we move the proposer logic here then repeated calls will propose again. We could refactor the code and add the section under a new API if that makes sense.
This pull request has merge conflicts that must be resolved before it can be |
Also, can we have a test for this PR? Appreciate it! |
Signed-off-by: ekagra <[email protected]>
Co-authored-by: Lily Liu <[email protected]>
Update: I updated the draft prob buffer layout to be packed. I tested the AL with my metric hack ekagra-ranjan#1 on MTBench (same setup) With K=2
so it seems to work. Next, I will add some unittest. |
I added non greedy sampling test to the I also logged the output for the sample prompt and compared SD (eagle) with vanilla output on various Temp below. It shows that while T=0 has exact match, T!=0 does not have exact match bw vanilla and SD however the answer is coherent. Output for various Temp (0, 0.3, 0.75, 1.0)
@LiuXiaoxuanPKU could you share your thoughts on how was non greedy tested in V0 OR how should we test this PR? Thanks for the review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ekagra-ranjan Thanks for the PR!
I haven't got a chance to look through the entire PR, but one thing I noticed is that the logic handling the draft probability is implemented inside the eagle. IIUC, this logic is general for all spec methods (e.g., medusa) so I think it should be implemented as a shared utility.
@WoosukKwon - Got it, will make the change. Could you confirm if you are referring to the already existing |
This pull request has merge conflicts that must be resolved before it can be |
Just want to check in. What's the status of this PR? Since it is mentioned (#15901 (comment)) , we still should properly handle the non-argmax sampling parameters. |
@wwl2755 I did some test and am waiting for @LiuXiaoxuanPKU to share if the result look fine. Once confirmed, I will bring this PR upto date with the recent changes that have happened. |
Task 3 of #15901
Enable non greedy sampling in Eagle
This PR makes changes to
proposer()
interface where instead of outputting the draft_token_ids or their prob, it saves them in the internal variable is accessed using getters likeget_draft_token_ids()
andget_draft_probs()