feat: move AR fusion kernels from trtllm #1061

yyihuang · 2025-05-15T18:05:09Z

Move All-reduce fusion kernels from trtllm to flashinfer

Requirements:

allreduce_fusion_kernel_oneshot_lamport (https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/communicationKernels/allReduceFusionKernels.cu#L440)
moereduce_allreduce_fusion_kernel_oneshot_lamport (https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/communicationKernels/moeAllReduceFusionKernels.cu#L213)
We might move more AR kernels than these two for simplicity, and the trtllm-style assertion might be removed.

Changes:

Add trtllm AR-fusion kernels
Add python interface
Add quantization utils
(optional) Add trtllm checker/assertion - unused yet, but replaced by torch check. Please eval it in this code review.

To be discussed:

maintain trtllm-style assertion/check? (currently with torch check)
interface level at allreduce_fusion_kernel_XXXX or allreduce_fusion_op? (currently at allreduce_fusion_op)

next todo:

compile
minimize dependency
add flashinfer logger, exception, check (maybe torch_check)
design communication module unified interface
unit test on python interface
benchmark (optional?)
unit test on C++ interface (not in plan)

…to trt_ar_fusion

…into trt_ar_fusion

yzh119 · 2025-05-28T18:22:11Z

Closed for now because it introduce deep trtllm dependency and hard to maintain.
We will split the trtllm comm kernels into three pieces:

one and two-shot allreduce kernels (w/ rmsnorm fusion), feat: add trtllm all-reduce (non-MoE) #1096
low-precision allreduce kernels
moe allreduce kernels

## 📌 Description We add trt-llm custom all-reduce to flashinfer comm module. ## 🔍 Related Issues We split this PR into multiple. #1061 MoE kernels are also in progress. ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes  --------- Co-authored-by: Zihao Ye <[email protected]>

yyihuang added 13 commits May 15, 2025 13:59

add tllm checker

ccc646f

init kernel impl

6f02558

rm tllm datatype dependency

ca62b28

upd namespace

0676df7

replace tllm_check with torch_check

4d4ff73

add python interface

f645557

fmt

491e0e7

upd inc file path

0c595dc

init moved cpp tests

f971991

upd

475b4c7

add all dependency (most of them would be removed)

3963b95

upd bf16 type utils

98b4503

upd build

a862d60

yzh119 mentioned this pull request May 23, 2025

comm: refactor and initialize flashinfer.comm module #1089

Merged

5 tasks

yyihuang and others added 16 commits May 24, 2025 00:59

Merge branch 'main' of https://github.com/flashinfer-ai/flashinfer in…

33757d3

…to trt_ar_fusion

Merge branch 'main' of https://github.com/flashinfer-ai/flashinfer in…

cd88adb

…to trt_ar_fusion

upd build

d7ab0b6

update to latest jit

50df16d

upd

6d537a0

upd link

6198fd7

upd

2a2f0b7

rm nccl and ub ar

51e28af

remove some headers

e9027d4

Merge branch 'trt_ar_fusion' of https://github.com/yyihuang/flashinfer …

b6af7f2

…into trt_ar_fusion

upd

a55fc56

upd

00d47e4

upd

17d407e

upd

012932f

upd

6e9b102

upd

0b5f024

yzh119 and others added 7 commits May 24, 2025 23:52

avoid using torch cpp runtime apis

1d9a806

Merge branch 'trt_ar_fusion' of https://github.com/yyihuang/flashinfer …

6f8b086

…into trt_ar_fusion

upd

f45e77d

upd

42ab951

fix mpi compilation

ebca61a

Merge branch 'trt_ar_fusion' of https://github.com/yyihuang/flashinfer …

b703dbc

…into trt_ar_fusion

bugfix

ebdad20

Fridge003 mentioned this pull request May 28, 2025

[Feature] integrate FlashInfer Blackwell kernels sgl-project/sglang#5855

Open

2 tasks

yyihuang mentioned this pull request May 28, 2025

feat: add trtllm all-reduce (non-MoE) #1096

Merged

5 tasks

yzh119 closed this May 28, 2025

yyihuang mentioned this pull request Jun 2, 2025

feat: add trtllm moe_allreduce_fusion #1108

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: move AR fusion kernels from trtllm #1061

feat: move AR fusion kernels from trtllm #1061

Uh oh!

yyihuang commented May 15, 2025 •

edited

Loading

Uh oh!

yzh119 commented May 28, 2025

Uh oh!

Uh oh!

feat: move AR fusion kernels from trtllm #1061

feat: move AR fusion kernels from trtllm #1061

Uh oh!

Conversation

yyihuang commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Move All-reduce fusion kernels from trtllm to flashinfer

Uh oh!

yzh119 commented May 28, 2025

Uh oh!

Uh oh!

yyihuang commented May 15, 2025 •

edited

Loading