-
-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Add quickreduce as alternative to custom allreduce #16804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
hi, @ilmarkov , |
5b81d85
to
96e1a3e
Compare
This pull request has merge conflicts that must be resolved before it can be |
96e1a3e
to
d92ccc8
Compare
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
Signed-off-by: ilmarkov <[email protected]>
d92ccc8
to
6f17424
Compare
what are the cases when custom allreduce performs better than quickreduce? It would better if quickreduce can surpass custom allreduce in all cases, then we can use quickreduce as a drop-in replacement of custom allreduce without a new user-facing flag. |
@youkaichao It is slower for smaller input sizes. We could do the similar approach as custom allreduce has - use one shot for small buffers and two shot for larger ones. |
that would be great, can you implement it? we can use either quickreduce or custom allreduce at the engine level, instead of dynamically switching based on the input size. |
Yes, we can try to implement this approach. |
you can use an environment variable, like |
This pull request has merge conflicts that must be resolved before it can be |
Add quickreduce alternative to custom allreduce.
The collective is only enabled on AMD. The kernels support quantized all reduce collective int4/int8 symmetric and asymmetric quantization algorithms.
The kernels can be enabled by
quick_reduce_allreduce_algo
config parameter. On AMD we first check if it is profitable to use quickreduce, otherwise we fallback to custom allreduce.