perf: Use 2WG pipeline design for MLA implementation on Hopper #952

yzh119 · 2025-03-17T07:55:10Z

This PR implements #892 .

Per benchmark, 2WG pipeline (FlashMLA's implementation) is faster than our current 3WG pipeline design on Hopper. While it remains investigation where the gap comes from, we should implements the 2WG (and 4WG in the future) pipeline in FlashInfer to make sure our implementation not getting worse performance than flashmla.

Performance

Before this PR:

Config: batch_size=64, seq_len=1024, num_heads=64
Memory bandwidth: 1547.23 GB/s
FLOPs: 167.29 TFLOPs
Config: batch_size=64, seq_len=1024, num_heads=128
Memory bandwidth: 1483.82 GB/s
FLOPs: 290.23 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=64
Memory bandwidth: 2238.72 GB/s
FLOPs: 242.06 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=128
Memory bandwidth: 1612.66 GB/s
FLOPs: 315.43 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=64
Memory bandwidth: 2821.32 GB/s
FLOPs: 305.05 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=128
Memory bandwidth: 1767.63 GB/s
FLOPs: 345.74 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=64
Memory bandwidth: 1960.50 GB/s
FLOPs: 223.79 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=128
Memory bandwidth: 1533.88 GB/s
FLOPs: 331.70 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=64
Memory bandwidth: 2546.83 GB/s
FLOPs: 290.72 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=128
Memory bandwidth: 1629.73 GB/s
FLOPs: 352.43 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=64
Memory bandwidth: 2820.22 GB/s
FLOPs: 321.93 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=128
Memory bandwidth: 1657.89 GB/s
FLOPs: 358.52 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=64
Memory bandwidth: 2682.98 GB/s
FLOPs: 319.63 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=128
Memory bandwidth: 1600.79 GB/s
FLOPs: 375.94 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=64
Memory bandwidth: 2803.48 GB/s
FLOPs: 333.98 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=128
Memory bandwidth: 1584.79 GB/s
FLOPs: 372.18 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=64
Memory bandwidth: 2768.36 GB/s
FLOPs: 329.80 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=128
Memory bandwidth: 1565.82 GB/s
FLOPs: 367.73 TFLOPs

After this PR:

Config: batch_size=64, seq_len=1024, num_heads=64
Memory bandwidth: 1509.87 GB/s
FLOPs: 163.25 TFLOPs
Config: batch_size=64, seq_len=1024, num_heads=128
Memory bandwidth: 1766.19 GB/s
FLOPs: 345.46 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=64
Memory bandwidth: 2307.97 GB/s
FLOPs: 249.55 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=128
Memory bandwidth: 1975.24 GB/s
FLOPs: 386.35 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=64
Memory bandwidth: 2871.63 GB/s
FLOPs: 310.49 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=128
Memory bandwidth: 2225.07 GB/s
FLOPs: 435.21 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=64
Memory bandwidth: 1948.15 GB/s
FLOPs: 222.38 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=128
Memory bandwidth: 1973.36 GB/s
FLOPs: 426.74 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=64
Memory bandwidth: 2625.63 GB/s
FLOPs: 299.72 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=128
Memory bandwidth: 2121.92 GB/s
FLOPs: 458.86 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=64
Memory bandwidth: 2996.11 GB/s
FLOPs: 342.01 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=128
Memory bandwidth: 2146.40 GB/s
FLOPs: 464.16 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=64
Memory bandwidth: 2717.28 GB/s
FLOPs: 323.71 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=128
Memory bandwidth: 2129.24 GB/s
FLOPs: 500.04 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=64
Memory bandwidth: 3002.75 GB/s
FLOPs: 357.72 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=128
Memory bandwidth: 2101.93 GB/s
FLOPs: 493.63 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=64
Memory bandwidth: 3083.42 GB/s
FLOPs: 367.33 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=128
Memory bandwidth: 2064.96 GB/s
FLOPs: 484.95 TFLOPs

Note

Profiler is broken (we changed the pipeline structure, will add them back in later PRs).
There is still room for improvement on pipeline design, e.g. we can prefetch next tile first kv-cache, which could further improve performance, we leave it for future work.

3. Synchronization is still sub-optimal, we insert a `__syncthreads()` in each iteration to guarantee correctness, can we further improve this?

upd wip upd upd

This reverts commit cda5311.

@abcdabcd987

Followup of #952 cc @abcdabcd987 ## Before this PR ``` Config: batch_size=64, seq_len=1024, num_heads=64 Memory bandwidth: 1509.87 GB/s FLOPs: 163.25 TFLOPs Config: batch_size=64, seq_len=1024, num_heads=128 Memory bandwidth: 1766.19 GB/s FLOPs: 345.46 TFLOPs Config: batch_size=128, seq_len=1024, num_heads=64 Memory bandwidth: 2307.97 GB/s FLOPs: 249.55 TFLOPs Config: batch_size=128, seq_len=1024, num_heads=128 Memory bandwidth: 1975.24 GB/s FLOPs: 386.35 TFLOPs Config: batch_size=768, seq_len=1024, num_heads=64 Memory bandwidth: 2871.63 GB/s FLOPs: 310.49 TFLOPs Config: batch_size=768, seq_len=1024, num_heads=128 Memory bandwidth: 2225.07 GB/s FLOPs: 435.21 TFLOPs Config: batch_size=64, seq_len=2048, num_heads=64 Memory bandwidth: 1948.15 GB/s FLOPs: 222.38 TFLOPs Config: batch_size=64, seq_len=2048, num_heads=128 Memory bandwidth: 1973.36 GB/s FLOPs: 426.74 TFLOPs Config: batch_size=128, seq_len=2048, num_heads=64 Memory bandwidth: 2625.63 GB/s FLOPs: 299.72 TFLOPs Config: batch_size=128, seq_len=2048, num_heads=128 Memory bandwidth: 2121.92 GB/s FLOPs: 458.86 TFLOPs Config: batch_size=768, seq_len=2048, num_heads=64 Memory bandwidth: 2996.11 GB/s FLOPs: 342.01 TFLOPs Config: batch_size=768, seq_len=2048, num_heads=128 Memory bandwidth: 2146.40 GB/s FLOPs: 464.16 TFLOPs Config: batch_size=64, seq_len=8192, num_heads=64 Memory bandwidth: 2717.28 GB/s FLOPs: 323.71 TFLOPs Config: batch_size=64, seq_len=8192, num_heads=128 Memory bandwidth: 2129.24 GB/s FLOPs: 500.04 TFLOPs Config: batch_size=128, seq_len=8192, num_heads=64 Memory bandwidth: 3002.75 GB/s FLOPs: 357.72 TFLOPs Config: batch_size=128, seq_len=8192, num_heads=128 Memory bandwidth: 2101.93 GB/s FLOPs: 493.63 TFLOPs Config: batch_size=768, seq_len=8192, num_heads=64 Memory bandwidth: 3083.42 GB/s FLOPs: 367.33 TFLOPs Config: batch_size=768, seq_len=8192, num_heads=128 Memory bandwidth: 2064.96 GB/s FLOPs: 484.95 TFLOPs ``` ## After this PR ``` Config: batch_size=64, seq_len=1024, num_heads=64 Memory bandwidth: 1596.98 GB/s FLOPs: 172.67 TFLOPs Config: batch_size=64, seq_len=1024, num_heads=128 Memory bandwidth: 1685.22 GB/s FLOPs: 329.62 TFLOPs Config: batch_size=128, seq_len=1024, num_heads=64 Memory bandwidth: 2280.49 GB/s FLOPs: 246.58 TFLOPs Config: batch_size=128, seq_len=1024, num_heads=128 Memory bandwidth: 1917.53 GB/s FLOPs: 375.06 TFLOPs Config: batch_size=768, seq_len=1024, num_heads=64 Memory bandwidth: 2869.03 GB/s FLOPs: 310.21 TFLOPs Config: batch_size=768, seq_len=1024, num_heads=128 Memory bandwidth: 2208.35 GB/s FLOPs: 431.94 TFLOPs Config: batch_size=64, seq_len=2048, num_heads=64 Memory bandwidth: 2047.44 GB/s FLOPs: 233.72 TFLOPs Config: batch_size=64, seq_len=2048, num_heads=128 Memory bandwidth: 1936.08 GB/s FLOPs: 418.67 TFLOPs Config: batch_size=128, seq_len=2048, num_heads=64 Memory bandwidth: 2617.48 GB/s FLOPs: 298.79 TFLOPs Config: batch_size=128, seq_len=2048, num_heads=128 Memory bandwidth: 2105.97 GB/s FLOPs: 455.41 TFLOPs Config: batch_size=768, seq_len=2048, num_heads=64 Memory bandwidth: 2999.55 GB/s FLOPs: 342.40 TFLOPs Config: batch_size=768, seq_len=2048, num_heads=128 Memory bandwidth: 2181.54 GB/s FLOPs: 471.75 TFLOPs Config: batch_size=64, seq_len=8192, num_heads=64 Memory bandwidth: 2780.86 GB/s FLOPs: 331.29 TFLOPs Config: batch_size=64, seq_len=8192, num_heads=128 Memory bandwidth: 2176.12 GB/s FLOPs: 511.05 TFLOPs Config: batch_size=128, seq_len=8192, num_heads=64 Memory bandwidth: 3031.58 GB/s FLOPs: 361.15 TFLOPs Config: batch_size=128, seq_len=8192, num_heads=128 Memory bandwidth: 2165.73 GB/s FLOPs: 508.61 TFLOPs Config: batch_size=768, seq_len=8192, num_heads=64 Memory bandwidth: 3126.37 GB/s FLOPs: 372.45 TFLOPs Config: batch_size=768, seq_len=8192, num_heads=128 Memory bandwidth: 2142.42 GB/s FLOPs: 503.14 TFLOPs ```

Follow up of #952 , this PR adds the instrument code base to profile mla hopper implementation (fix #995 )

yzh119 force-pushed the flashmla branch from d4a3884 to 9085da9 Compare March 26, 2025 08:08

yzh119 changed the title ~~[WIP] Use 2WG pipeline design for MLA implementation on Hopper~~ perf: Use 2WG pipeline design for MLA implementation on Hopper Mar 26, 2025

yzh119 marked this pull request as ready for review March 26, 2025 08:59

yzh119 mentioned this pull request Mar 26, 2025

[Tracking Issue] MLA performance tracking #897

Open

10 tasks

yzh119 added 11 commits March 27, 2025 00:51

upd

c91d3a2

upd wip upd upd

fix

e574546

Revert "fix"

f9fa2fe

This reverts commit cda5311.

wip

e1f54de

wip

0884fee

wip

3e5be53

wip

74078bc

wip

34d2eb9

upd

642e824

bugfix

49b2a8a

upd

a9c9e24

yzh119 force-pushed the flashmla branch from 4b8bf74 to a9c9e24 Compare March 27, 2025 00:51

yzh119 added 9 commits March 27, 2025 00:52

upd

a84e1ea

upd

e4f9f81

bugfix

c6d84ef

improve benchmark

f68cdb3

upd

d52a593

upd

cb475d5

passed all unittests

d4a8e94

too many sync, but works

0d41add

all bugs cleared

ab1d5d9

yzh119 merged commit 60d37b7 into flashinfer-ai:main Mar 29, 2025
2 checks passed

This was referenced Mar 29, 2025

FlashMLA from DeepSeek #892

Closed

perf: prefetch page indices for mla kernel #991

Merged

yzh119 mentioned this pull request Apr 2, 2025

JIT Compilation Fails with "identifier 'entry' is undefined" When Enabling Profiler #995

Closed

This was referenced Apr 8, 2025

A question about the desc_ckv of mla_hopper.cuh compute_mla_pv #975

Closed

misc: fix instrument code for mla profiler #1014

Merged

MasterJH5574 pushed a commit that referenced this pull request Apr 10, 2025

misc: fix instrument code for mla profiler (#1014)

55576c6

Follow up of #952 , this PR adds the instrument code base to profile mla hopper implementation (fix #995 )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

Uh oh!

yzh119 commented Mar 17, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

perf: Use 2WG pipeline design for MLA implementation on Hopper #952

Uh oh!

Conversation

yzh119 commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Note

Uh oh!

Uh oh!

Uh oh!

yzh119 commented Mar 17, 2025 •

edited

Loading