FlashMLA from DeepSeek #892

zhyncs · 2025-02-24T01:34:57Z

as titled

ref https://github.com/deepseek-ai/FlashMLA

celsowm · 2025-02-24T02:32:13Z

I went here for it ! @zhyncs was really fast

MichoChan · 2025-02-24T06:06:48Z

#887 how about this？compare vs https://github.com/deepseek-ai/FlashMLA ？

yzh119 · 2025-02-24T06:28:10Z

The pipeline design is a little bit different from #887, I'll check what we can learn from it.

yzh119 · 2025-02-24T06:53:05Z

@zhyncs @celsowm @MichoChan here is the result I got on H100, by running the latest flashinfer and FlashMLA mainline (the higher the better), for flashinfer we use page_size=1 and FlashMLA uses page_size=64.

abcdabcd987 · 2025-02-24T07:16:02Z

Here's my benchmark code and result on H100:

https://gist.github.com/abcdabcd987/b215c5f00f4b5e8399b95d7933bcf475

https://docs.google.com/spreadsheets/d/1t0Txa7Ph9u7Su9LyWpS24vqr9A5FB-FyL0EZNpYOqwg/edit?gid=0#gid=0

Both are using page size 64. FlashMLA is faster in general, way faster on small batch sizes.

As pointed in #892 (comment) The second stage of split-k seems to have a huge overhead. This PR is the first second in addressing these issues, by changing the vector size from 4 to 8.

yzh119 · 2025-02-24T08:06:18Z

Hi @abcdabcd987 , yes I didn't profiled the low batch size use cases, and I just realized we get low performance for small batch and long context.

#894 alleviate the issue a little bit.

Regarding the cases (qo_len * num_heads >= 128), the current flashinfer implementation is not good at this, because we prioritize page_size=1 (for larger page_size, using tma + multicast would help). I'll also take a look at FlashMLA's implementation and check how does their schedule deal with this case.

liangzelang · 2025-02-24T11:57:40Z

I found DeepSeek FlashMLA is very very faster than flashinfer when q_head_num equals to 128 (tp1) , almost faster 100% when bs=32. but when q_head_num equals to [16 32 64], faster 10%-20%.
test on H800

yzh119 · 2025-02-24T17:56:23Z

We will try out the FlashMLA-style warp specialization in the next release.

Created an issue for performance tracking: #897

As observed in #892 , we found flashinfer mla's second stage of split-k is very slow (when batch size is small), this is because our scheduler only uses one CTA for the second stage of split-k. This PR fixes the issue.

yanghailong-git · 2025-02-27T11:27:18Z

Here's my benchmark code and result on H100:

https://gist.github.com/abcdabcd987/b215c5f00f4b5e8399b95d7933bcf475

https://docs.google.com/spreadsheets/d/1t0Txa7Ph9u7Su9LyWpS24vqr9A5FB-FyL0EZNpYOqwg/edit?gid=0#gid=0

Both are using page size 64. FlashMLA is faster in general, way faster on small batch sizes.

Hello, I noticed the significant speed improvement in the latest test results, but the test script throws errors when running with the new version of FlashInfer. If modifications are needed for the test script?

yzh119 · 2025-02-27T15:58:45Z

@yanghailong-git can you report the error message?

yanghailong-git · 2025-02-28T02:41:24Z

@yanghailong-git can you report the error message?

When running this script https://gist.github.com/abcdabcd987/b215c5f00f4b5e8399b95d7933bcf475 with version v0.2.2.post1, I encountered the error below. How should I resolve this? Thanks.

yzh119 · 2025-02-28T03:29:21Z

Can you post the full error message in text instead, some key information were clipped in your screenshot.

yanghailong-git · 2025-02-28T03:46:49Z

Can you post the full error message in text instead, some key information were clipped in your screenshot.

The detailed error is as follows:

2025-02-28 11:38:16,665 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
nhead   q_len   kv_len  bs      FA2     FA3     FlashMLA
2025-02-28 11:38:16,846 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64
2025-02-28 11:38:45,662 - INFO - flashinfer.jit: Finished loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64
2025-02-28 11:38:45,667 - INFO - flashinfer.jit: Loading JIT ops: batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90
Traceback (most recent call last):
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2104, in _run_ninja_build
    subprocess.run(
  File "/root/miniconda3/envs/torch/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/workspace/work/yhl/mla-flashinfer-vs-deepseek-0.2.2.post1.py", line 155, in <module>
    main()
  File "/root/workspace/work/yhl/mla-flashinfer-vs-deepseek-0.2.2.post1.py", line 151, in main
    bench_ragged_vs_mla(num_heads, q_len, kv_len, batch_size)
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/root/workspace/work/yhl/mla-flashinfer-vs-deepseek-0.2.2.post1.py", line 66, in bench_ragged_vs_mla
    mla.plan(
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/mla.py", line 225, in plan
    self._cached_module = get_batch_mla_module(self._backend)(
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/mla.py", line 44, in backend_module
    modules_dict[args] = gen_batch_mla_module(backend, *args)
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/jit/attention/pytorch.py", line 181, in gen_batch_mla_module
    return load_cuda_ops(
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/jit/core.py", line 123, in load_cuda_ops
    torch_cpp_ext.load(
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1314, in load
    return _jit_compile(
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1721, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1833, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2120, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error building extension 'batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90': [1/4] /root/miniconda3/envs/torch/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_mla_sm90_run.cuda.o.d -ccbin /root/miniconda3/envs/torch/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/include -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/csrc -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/cutlass/include -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/envs/torch/include -isystem /root/miniconda3/envs/torch/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_run.cu -o batch_mla_sm90_run.cuda.o 
FAILED: batch_mla_sm90_run.cuda.o 
/root/miniconda3/envs/torch/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_mla_sm90_run.cuda.o.d -ccbin /root/miniconda3/envs/torch/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/include -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/csrc -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/cutlass/include -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/envs/torch/include -isystem /root/miniconda3/envs/torch/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_run.cu -o batch_mla_sm90_run.cuda.o 
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_run.cu:21:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_run.cu:21:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_run.cu:21:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/attention/mla_hopper.cuh(554): error: explicit type is missing ("int" assumed)
  __attribute__((device)) __inline__ __attribute__((always_inline)) convert_s_to_p(float* s_frag, uint32_t* p_frag) {
                          ^

1 error detected in the compilation of "/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_run.cu".
[2/4] /root/miniconda3/envs/torch/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_mla_sm90_pybind.cuda.o.d -ccbin /root/miniconda3/envs/torch/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/include -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/csrc -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/cutlass/include -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/envs/torch/include -isystem /root/miniconda3/envs/torch/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_pybind.cu -o batch_mla_sm90_pybind.cuda.o 
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_pybind.cu:16:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_pybind.cu:16:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_pybind.cu:16:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
[3/4] /root/miniconda3/envs/torch/bin/nvcc --generate-dependencies-with-compile --dependency-output batch_mla_sm90_plan.cuda.o.d -ccbin /root/miniconda3/envs/torch/bin/x86_64-conda-linux-gnu-cc -DTORCH_EXTENSION_NAME=batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/include -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/csrc -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/cutlass/include -I/root/miniconda3/envs/torch/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/TH -isystem /root/miniconda3/envs/torch/lib/python3.10/site-packages/torch/include/THC -isystem /root/miniconda3/envs/torch/include -isystem /root/miniconda3/envs/torch/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 --expt-relaxed-constexpr -gencode=arch=compute_90,code=sm_90 --compiler-options '-fPIC' -O3 -std=c++17 --threads 4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -gencode=arch=compute_90a,code=sm_90a -c /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_plan.cu -o batch_mla_sm90_plan.cuda.o 
nvcc warning : incompatible redefinition for option 'compiler-bindir', the last value of this option was used
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_plan.cu:19:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_plan.cu:19:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
In file included from /root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_plan.cu:19:
/root/.cache/flashinfer/90/generated/batch_mla_attention_dtype_q_bf16_dtype_kv_bf16_dtype_o_bf16_dtype_idx_i32_head_dim_ckv_512_head_dim_kpe_64_sm90/batch_mla_sm90_config.inc:20:111: warning: backslash-newline at end of file
   20 | #define DISPATCH_context(DTypeQ, DTypeKV, DTypeO, IdType, MASK_MODE, HEAD_DIM_CKV, HEAD_DIM_KPE, Params, ...) \
      |                                                                                                                
ninja: build stopped: subcommand failed.

yzh119 · 2025-02-28T16:14:20Z

@yanghailong-git #904 should fix it.

This PR implements #892 . Per benchmark, 2WG pipeline (FlashMLA's implementation) is faster than our current 3WG pipeline design on Hopper. While it remains investigation where the gap comes from, we should implements the 2WG (and 4WG in the future) pipeline in FlashInfer to make sure our implementation not getting worse performance than flashmla. ## Performance Before this PR: ``` Config: batch_size=64, seq_len=1024, num_heads=64 Memory bandwidth: 1547.23 GB/s FLOPs: 167.29 TFLOPs Config: batch_size=64, seq_len=1024, num_heads=128 Memory bandwidth: 1483.82 GB/s FLOPs: 290.23 TFLOPs Config: batch_size=128, seq_len=1024, num_heads=64 Memory bandwidth: 2238.72 GB/s FLOPs: 242.06 TFLOPs Config: batch_size=128, seq_len=1024, num_heads=128 Memory bandwidth: 1612.66 GB/s FLOPs: 315.43 TFLOPs Config: batch_size=768, seq_len=1024, num_heads=64 Memory bandwidth: 2821.32 GB/s FLOPs: 305.05 TFLOPs Config: batch_size=768, seq_len=1024, num_heads=128 Memory bandwidth: 1767.63 GB/s FLOPs: 345.74 TFLOPs Config: batch_size=64, seq_len=2048, num_heads=64 Memory bandwidth: 1960.50 GB/s FLOPs: 223.79 TFLOPs Config: batch_size=64, seq_len=2048, num_heads=128 Memory bandwidth: 1533.88 GB/s FLOPs: 331.70 TFLOPs Config: batch_size=128, seq_len=2048, num_heads=64 Memory bandwidth: 2546.83 GB/s FLOPs: 290.72 TFLOPs Config: batch_size=128, seq_len=2048, num_heads=128 Memory bandwidth: 1629.73 GB/s FLOPs: 352.43 TFLOPs Config: batch_size=768, seq_len=2048, num_heads=64 Memory bandwidth: 2820.22 GB/s FLOPs: 321.93 TFLOPs Config: batch_size=768, seq_len=2048, num_heads=128 Memory bandwidth: 1657.89 GB/s FLOPs: 358.52 TFLOPs Config: batch_size=64, seq_len=8192, num_heads=64 Memory bandwidth: 2682.98 GB/s FLOPs: 319.63 TFLOPs Config: batch_size=64, seq_len=8192, num_heads=128 Memory bandwidth: 1600.79 GB/s FLOPs: 375.94 TFLOPs Config: batch_size=128, seq_len=8192, num_heads=64 Memory bandwidth: 2803.48 GB/s FLOPs: 333.98 TFLOPs Config: batch_size=128, seq_len=8192, num_heads=128 Memory bandwidth: 1584.79 GB/s FLOPs: 372.18 TFLOPs Config: batch_size=768, seq_len=8192, num_heads=64 Memory bandwidth: 2768.36 GB/s FLOPs: 329.80 TFLOPs Config: batch_size=768, seq_len=8192, num_heads=128 Memory bandwidth: 1565.82 GB/s FLOPs: 367.73 TFLOPs ``` After this PR: ``` Config: batch_size=64, seq_len=1024, num_heads=64 Memory bandwidth: 1509.87 GB/s FLOPs: 163.25 TFLOPs Config: batch_size=64, seq_len=1024, num_heads=128 Memory bandwidth: 1766.19 GB/s FLOPs: 345.46 TFLOPs Config: batch_size=128, seq_len=1024, num_heads=64 Memory bandwidth: 2307.97 GB/s FLOPs: 249.55 TFLOPs Config: batch_size=128, seq_len=1024, num_heads=128 Memory bandwidth: 1975.24 GB/s FLOPs: 386.35 TFLOPs Config: batch_size=768, seq_len=1024, num_heads=64 Memory bandwidth: 2871.63 GB/s FLOPs: 310.49 TFLOPs Config: batch_size=768, seq_len=1024, num_heads=128 Memory bandwidth: 2225.07 GB/s FLOPs: 435.21 TFLOPs Config: batch_size=64, seq_len=2048, num_heads=64 Memory bandwidth: 1948.15 GB/s FLOPs: 222.38 TFLOPs Config: batch_size=64, seq_len=2048, num_heads=128 Memory bandwidth: 1973.36 GB/s FLOPs: 426.74 TFLOPs Config: batch_size=128, seq_len=2048, num_heads=64 Memory bandwidth: 2625.63 GB/s FLOPs: 299.72 TFLOPs Config: batch_size=128, seq_len=2048, num_heads=128 Memory bandwidth: 2121.92 GB/s FLOPs: 458.86 TFLOPs Config: batch_size=768, seq_len=2048, num_heads=64 Memory bandwidth: 2996.11 GB/s FLOPs: 342.01 TFLOPs Config: batch_size=768, seq_len=2048, num_heads=128 Memory bandwidth: 2146.40 GB/s FLOPs: 464.16 TFLOPs Config: batch_size=64, seq_len=8192, num_heads=64 Memory bandwidth: 2717.28 GB/s FLOPs: 323.71 TFLOPs Config: batch_size=64, seq_len=8192, num_heads=128 Memory bandwidth: 2129.24 GB/s FLOPs: 500.04 TFLOPs Config: batch_size=128, seq_len=8192, num_heads=64 Memory bandwidth: 3002.75 GB/s FLOPs: 357.72 TFLOPs Config: batch_size=128, seq_len=8192, num_heads=128 Memory bandwidth: 2101.93 GB/s FLOPs: 493.63 TFLOPs Config: batch_size=768, seq_len=8192, num_heads=64 Memory bandwidth: 3083.42 GB/s FLOPs: 367.33 TFLOPs Config: batch_size=768, seq_len=8192, num_heads=128 Memory bandwidth: 2064.96 GB/s FLOPs: 484.95 TFLOPs ``` ## Note 1. Profiler is broken (we changed the pipeline structure, will add them back in later PRs). 2. There is still room for improvement on pipeline design, e.g. we can prefetch next tile first kv-cache, which could further improve performance, we leave it for future work. <img width="1230" alt="image" src="https://github.com/user-attachments/assets/e84b1d55-3361-48a1-b339-97837cb97bfb" /> 3. Synchronization is still sub-optimal, we insert a `__syncthreads()` in each iteration to guarantee correctness, can we further improve this?

yzh119 · 2025-03-29T04:52:27Z

Done in #952 . There are some slight discrepancy between current flashinfer's mla implementation (we support any page size and any query length) and flashmla's, but the pipeline structure is the same now.

yzh119 mentioned this issue Feb 24, 2025

perf: fix the performance of second stage of split-k #894

Merged

yzh119 mentioned this issue Feb 24, 2025

[Tracking Issue] MLA performance tracking #897

Open

10 tasks

yzh119 mentioned this issue Feb 25, 2025

perf: fix MLA split-k performance bug #898

Merged

liuxuefeng727 mentioned this issue Mar 3, 2025

[Feature] Do we have any plan to integrate FLashMLA sgl-project/sglang#4006

Closed

2 tasks

yzh119 mentioned this issue Mar 17, 2025

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

Merged

yzh119 closed this as completed Mar 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlashMLA from DeepSeek #892

FlashMLA from DeepSeek #892

zhyncs commented Feb 24, 2025

celsowm commented Feb 24, 2025 •

edited

Loading

MichoChan commented Feb 24, 2025

yzh119 commented Feb 24, 2025

yzh119 commented Feb 24, 2025 •

edited

Loading

abcdabcd987 commented Feb 24, 2025 •

edited

Loading

yzh119 commented Feb 24, 2025 •

edited

Loading

liangzelang commented Feb 24, 2025

yzh119 commented Feb 24, 2025

yanghailong-git commented Feb 27, 2025

yzh119 commented Feb 27, 2025

yanghailong-git commented Feb 28, 2025

yzh119 commented Feb 28, 2025 •

edited

Loading

yanghailong-git commented Feb 28, 2025 •

edited by yzh119

Loading

yzh119 commented Feb 28, 2025

yzh119 commented Mar 29, 2025

FlashMLA from DeepSeek #892

FlashMLA from DeepSeek #892

Comments

zhyncs commented Feb 24, 2025

celsowm commented Feb 24, 2025 • edited Loading

MichoChan commented Feb 24, 2025

yzh119 commented Feb 24, 2025

yzh119 commented Feb 24, 2025 • edited Loading

abcdabcd987 commented Feb 24, 2025 • edited Loading

yzh119 commented Feb 24, 2025 • edited Loading

liangzelang commented Feb 24, 2025

yzh119 commented Feb 24, 2025

yanghailong-git commented Feb 27, 2025

yzh119 commented Feb 27, 2025

yanghailong-git commented Feb 28, 2025

yzh119 commented Feb 28, 2025 • edited Loading

yanghailong-git commented Feb 28, 2025 • edited by yzh119 Loading

yzh119 commented Feb 28, 2025

yzh119 commented Mar 29, 2025

celsowm commented Feb 24, 2025 •

edited

Loading

yzh119 commented Feb 24, 2025 •

edited

Loading

abcdabcd987 commented Feb 24, 2025 •

edited

Loading

yzh119 commented Feb 24, 2025 •

edited

Loading

yzh119 commented Feb 28, 2025 •

edited

Loading

yanghailong-git commented Feb 28, 2025 •

edited by yzh119

Loading