[Kernel] DeepEP dispatch-combine kernel integration #18434

varun-sundar-rabindranath · 2025-05-20T20:00:48Z

Integrate DeepEP dispatch-combine kernels

Integrated DeepEP high-throughput and low-latency kernels
Integrate DeepEP high-throughput kernel with the corresponding DeepGemm kernel

Correctness:

Tested correctness using lm_eval on H100 for,
Models: deepseek-ai/DeepSeek-V2-Lite RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8 Qwen/Qwen3-30B-A3B-FP8
ALL2ALL Backend: deepep_high_throughput
Cases: for DP=2 TP=1 case.
Tested correctness using lm_eval on H100 for,
Models: deepseek-ai/DeepSeek-V2-Lite RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8
ALL2ALL Backend: deepep_low_latency
Cases: for DP=2 TP=1 case.
Note: DeepEP Low Latency kernels are compiled only for a set of hidden-sizes. DeepSeekV2-lite hidden sizes are not among them. I had to update the DeepEP to support the hidden size to do this test.
Tested correctness using lm_eval on A100 for,
Models: deepseek-ai/DeepSeek-V2-Lite
ALL2ALL Backend: pplx
Cases: DP=2 TP=1

github-actions · 2025-05-20T20:00:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

vllm/model_executor/layers/fused_moe/layer.py

mergify · 2025-05-23T15:46:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/envs.py

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

mergify · 2025-05-29T08:41:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @varun-sundar-rabindranath.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/deep_gemm_moe.py

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

vllm/model_executor/layers/fused_moe/deepep_prepare_finalize.py

vllm/model_executor/layers/fused_moe/layer.py

bnellnm · 2025-05-30T19:12:41Z

vllm/model_executor/layers/quantization/fp8.py

+        # TODO (varun) : deepgemm integration
+        self.use_batched_experts = False
+        if envs.VLLM_ALL2ALL_BACKEND == "deepep_ll":
+            self.use_batched_experts = True


It might be better to add a method to prepare_finalize that says whether or not the format is batched or not, instead of using an env var or checking the instance type.

introduced a function max_num_tokens_per_rank() to the prepare_finalize objects - We can use it to determine batching 👍

vllm/model_executor/layers/quantization/fp8.py

vllm/model_executor/layers/fused_moe/deep_gemm_moe.py

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

vllm/model_executor/layers/fused_moe/deepep_prepare_finalize.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py

tlrmchlsmth

Looks good. I just left a few minor comments. Going to try this out in a multinode setup

tests/kernels/moe/test_deepep_deepgemm_moe.py

bnellnm · 2025-06-02T16:26:25Z

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

+            # weights have already been applied.
+            combine_topk_weights = torch.ones_like(topk_weights)
+
+        # TODO (varun) : Enable zero copy mode


Still TODO?

Yeah. It could be fast-follow.

vllm/model_executor/layers/fused_moe/deepep_prepare_finalize.py

tests/kernels/moe/test_deepep_deepgemm_moe.py

vllm/platforms/cuda.py

tlrmchlsmth · 2025-06-02T23:21:27Z

vllm/model_executor/layers/fused_moe/triton_deep_gemm_moe.py

-                and _valid_deep_gemm(hidden_states, w1, w2, expert_map)):
-            return self.deep_gemm_expert.apply(
+                and _valid_deep_gemm(hidden_states, w1, w2)):
+            return self.deep_gemm_expert.apply(  #type: ignore


Why do we need the #type: ignore? Could you add a comment (or better: resolve the type issues)?

deep_gemm_expert is an Optional (its existence depends on the self.allow_deep_gemm variable - that is checked right above this line) - let me see if an assert right above fixes it. - also didn't unnecessarily want to introduce an assert on the hot-path.

tlrmchlsmth · 2025-06-02T23:24:00Z

vllm/model_executor/layers/fused_moe/layer.py

+                num_nvl_bytes=1024 * 1024 * 1024,  # 1Gb
+                num_rdma_bytes=0,


Is this right for the multinode case? Thinking we might need to set num_rdma_bytes > 0 in that case.

You are correct - It does need to be >1 for the internode case. Let me fix that 👍 nice catch !

@tlrmchlsmth - fixed it in b841fac - Got the defaults from the DeepEP tests. We can update it if need be.

tlrmchlsmth · 2025-06-02T23:24:26Z

vllm/model_executor/layers/fused_moe/layer.py

+                num_nvl_bytes=1024 * 1024 * 1024,  # 1Gb
+                num_rdma_bytes=0,
+                low_latency_mode=False,
+                num_qps_per_rank=1)


@varun-sundar-rabindranath do you know what this argument is for?

from the docs, this is the number of parallel RDMA connections each rank can establish.

it is set to NVSHMEM_IBGDA_NUM_RC_PER_PE environment variable in code.

Do you mean that we should be reading the NVSHMEM_IBGDA_NUM_RC_PER_PE env and passing it in here?

no - what is passed here is set as that env var - here -> https://github.com/deepseek-ai/DeepEP/blob/9fe9021f29c9083cd1808ab36b740208524d9f63/deep_ep/buffer.py#L79

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py

tlrmchlsmth · 2025-06-02T23:33:23Z

tests/kernels/moe/test_deepep_deepgemm_moe.py

+    n_tiles_w1 = ((2 * n) + block_n - 1) // block_n
+    k_tiles_w1 = (k + block_k - 1) // block_k
+    n_tiles_w2 = (k + block_n - 1) // block_n
+    k_tiles_w2 = (n + block_k - 1) // block_k


nit: More readable to use e.g. k_tiles_w1 = round_up(k, block_k)

vllm/vllm/utils.py

Lines 729 to 730 in c57d577

def round_up(x: int, y: int) -> int:

return ((x + y - 1) // y) * y

tlrmchlsmth

Looks good overall.

I had some questions on the construction of the all_to_all_args for the HT and LL cases -- want to make sure we're good on num_rdma_bytes, num_qps_per_rank before landing.

Other stuff is pretty minor

bnellnm · 2025-06-03T01:25:15Z

vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py

+                                  apply_router_weight_on_input: bool,
+                                  output_dtype: torch.dtype):
+
+        if fused_expert_output.ndim == 2:


Why would fused_expert_output have varying ndim?

the DeepEP high-throughput dispatch kernel does not give batched output - as a result we end up using the TritonOrDeemGemmExperts - the output of that "experts" is 2 dim.

bnellnm · 2025-06-03T01:31:28Z

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

+                 max_tokens_per_rank: int,
+                 quant_dtype: Optional[torch.dtype] = None,
+                 block_shape: Optional[list[int]] = None,
+                 use_fp8_dispatch: bool = False):


Does this flag indicate that the input has already been quantized?

No. This is a performance related option in the low-latency kernels.
The low-latency Dispatch kernel can only dispatch bfloat16.
This option informs the kernel to quantize the inputs internally and dispatch them fp8. The kernel outputs tokens, and scales which we dequantize in the receiving end.

bnellnm · 2025-06-03T01:32:53Z

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py

+        if apply_router_weight_on_input:
+            topk = rank_topk_ids.size(1)
+            # TODO: this only works for topK=1, will need to update for topK>1
+            assert topk == 1, (
+                "apply_router_weight_on_input is only implemented for topk=1")
+            a1 = a1 * rank_topk_weights.to(a1.dtype)


I feel like this snippet has been repeated enough that we should make a utility out of it at some point. We can save it for a later PR tho.

yeah. I felt the same. something like maybe_apply_router_weight_on_input - But I agree we can put it in a later PR.

bnellnm · 2025-06-03T01:38:12Z

vllm/model_executor/layers/fused_moe/layer.py

+        use_batched_experts = (
+            isinstance(prepare_finalize, BatchedPrepareAndFinalize) or
+            (has_pplx and isinstance(prepare_finalize, PplxPrepareAndFinalize))
+            or (has_deepep
+                and isinstance(prepare_finalize, DeepEPLLPrepareAndFinalize)))


We should add a method to the prepare and finalize class that returns the activation format, e.g. expert-batched vs. non-batched. Then we won't need all the isinstance checks. I'm fine with doing this in another PR tho.

fixed it 👍

bnellnm · 2025-06-03T01:40:57Z

vllm/model_executor/layers/fused_moe/layer.py

+        if (self.moe_parallel_config.use_pplx_kernels
+                or self.moe_parallel_config.use_deepep_ll_kernels):


This isn't applicable for the ht kernels?

No. The HT kernels are not batched.

bnellnm · 2025-06-03T01:41:54Z

vllm/model_executor/layers/fused_moe/layer.py

+        if (self.use_pplx_kernels or self.use_deepep_ht_kernels
+                or self.use_deepep_ll_kernels):


Maybe we should add a new umbrella property for all these types of kernels?

bnellnm · 2025-06-03T01:42:24Z

vllm/model_executor/layers/fused_moe/layer.py

@@ -1305,12 +1408,17 @@ def process_chunk(chunk_start, chunk_end, skip_result_store=False):
    def forward_impl(self, hidden_states: torch.Tensor,
                     router_logits: torch.Tensor):
        assert self.quant_method is not None
-        if self.moe_parallel_config.use_pplx_kernels:
+        if (self.moe_parallel_config.use_pplx_kernels
+                or self.moe_parallel_config.use_deepep_ll_kernels):


No ht kernels here either?

No. the HT kernels aren't batched.

bnellnm · 2025-06-03T01:43:45Z

vllm/model_executor/layers/fused_moe/layer.py

+        do_naive_dispatch_combine: bool = (
+            self.dp_size > 1
+            and not self.moe_parallel_config.use_deepep_ht_kernels)
+        if do_naive_dispatch_combine:


Another future TODO, put the naive dispatch/combine into a NaivePrepareAndFinalize class. (I was planning on doing this but thought I'd mention it for posterity).

bnellnm · 2025-06-03T01:47:27Z

vllm/model_executor/layers/quantization/fp8.py

@@ -459,8 +462,10 @@ def __init__(self, quant_config: Fp8Config):
                logger.warning_once(
                    "DeepGemm not supported on the current platform.")

+        self.topk_indices_dtype = None


Is this initialized in the base class and set when select_gemm_impl is called? Maybe we can do away with this?

there is an edge case actually - the case with DP=1 TP=2 and enable_expert_parallel -- there the select_gemm_impl isn't called at all. I ran into this when testing.

tlrmchlsmth

Looks good to me after the latest commit

Signed-off-by: Varun <[email protected]>

varun-sundar-rabindranath requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners May 20, 2025 20:00

varun-sundar-rabindranath marked this pull request as draft May 20, 2025 20:00

bnellnm reviewed May 21, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

mergify bot added needs-rebase v1 labels May 23, 2025

varun-sundar-rabindranath force-pushed the varun/deepep branch from 3bc6ab7 to 74be347 Compare May 27, 2025 18:39

mergify bot removed the needs-rebase label May 27, 2025

tlrmchlsmth reviewed May 27, 2025

View reviewed changes

vllm/envs.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py Outdated Show resolved Hide resolved

mergify bot added the needs-rebase label May 29, 2025

bnellnm reviewed May 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deep_gemm_moe.py Outdated Show resolved Hide resolved

bnellnm reviewed May 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deepep_ll_prepare_finalize.py Outdated Show resolved Hide resolved

bnellnm reviewed May 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deepep_prepare_finalize.py Outdated Show resolved Hide resolved

bnellnm reviewed May 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

bnellnm reviewed May 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

bnellnm reviewed May 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

bnellnm reviewed May 30, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/layer.py Outdated Show resolved Hide resolved

bnellnm reviewed May 30, 2025

View reviewed changes

vllm/model_executor/layers/quantization/fp8.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath force-pushed the varun/deepep branch from 17fa499 to 3fa74f4 Compare May 30, 2025 21:52

mergify bot removed the needs-rebase label May 30, 2025

varun-sundar-rabindranath marked this pull request as ready for review May 30, 2025 21:57

tlrmchlsmth reviewed Jun 1, 2025

View reviewed changes

tests/kernels/moe/test_deepep_deepgemm_moe.py Outdated Show resolved Hide resolved

varun-sundar-rabindranath force-pushed the varun/deepep branch from 7a129ec to 62916dd Compare June 1, 2025 20:29

bnellnm reviewed Jun 2, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/deepep_prepare_finalize.py Outdated Show resolved Hide resolved

bnellnm reviewed Jun 2, 2025

View reviewed changes

tests/kernels/moe/test_deepep_deepgemm_moe.py Show resolved Hide resolved

bnellnm reviewed Jun 2, 2025

View reviewed changes

vllm/platforms/cuda.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Jun 2, 2025

View reviewed changes

bnellnm reviewed Jun 3, 2025

View reviewed changes

bnellnm approved these changes Jun 3, 2025

View reviewed changes

tlrmchlsmth approved these changes Jun 3, 2025

View reviewed changes

tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 3, 2025

Varun Sundar Rabindranath added 2 commits June 3, 2025 04:47

DeepEP integration

1ea50a5

Signed-off-by: Varun <[email protected]>

fix lint

b03fd2c

Signed-off-by: Varun <[email protected]>

varun-sundar-rabindranath force-pushed the varun/deepep branch from aaa6ee3 to b03fd2c Compare June 3, 2025 05:10

Varun added 3 commits June 3, 2025 05:39

print -> logger-debug

484bf1c

Signed-off-by: Varun <[email protected]>

fix cruft

4d7a427

Signed-off-by: Varun <[email protected]>

fix abstract method impl

0b4158b

Signed-off-by: Varun <[email protected]>

simon-mo merged commit fa98d77 into vllm-project:main Jun 3, 2025
93 of 97 checks passed

	def round_up(x: int, y: int) -> int:
	return ((x + y - 1) // y) * y

		if (self.moe_parallel_config.use_pplx_kernels
		or self.moe_parallel_config.use_deepep_ll_kernels):

		if (self.use_pplx_kernels or self.use_deepep_ht_kernels
		or self.use_deepep_ll_kernels):

Uh oh!

[Kernel] DeepEP dispatch-combine kernel integration #18434

[Kernel] DeepEP dispatch-combine kernel integration #18434

Conversation

varun-sundar-rabindranath commented May 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 20, 2025

Uh oh!

Uh oh!

mergify bot commented May 23, 2025

Uh oh!

Uh oh!

Uh oh!

mergify bot commented May 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath commented May 20, 2025 •

edited by github-actions bot

Loading