-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[V1] Ensure using int64 for sampled token ids #15065
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
cc @houseroad |
# with subsequent operations that may use these values as indices. | ||
# This conversion is necessary because FlashInfer sampling operations | ||
# return int32 (while PyTorch argmax and topk return int64). | ||
sampled = sampled.long() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I am debating to keep this in the gather_logprobs, since we may skip the conversion if gather_logprobs is not called. what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think that's error prone. For example, other ops in the future might try to use this tensor for indexing and get the same error.
- The op should be very cheap, it's supposed to be a no-op for common case (no top-p or top-k) where
sampled
is already a long tensor. Even if it's not,sampled
tensor here is pretty small, so I don't think its overhead will matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed with @WoosukKwon here, sampled
here is pretty small even in the worst case (1024)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, if we see overhead, we can always optimize it :-)
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]>
Signed-off-by: Woosuk Kwon <[email protected]> Signed-off-by: Mu Huai <[email protected]>
A more fundamental solution to the bug in #14999 and #15049