fp8 support #54

endurehero · 2025-02-28T14:37:17Z

Functionality

Support FP8 WGMMA based on the async pipeline design of FlashMLA. The TransV part draws on the implementation of SmemTranspose64x64 in Fa3.
Currently, Q/K/V only support symmetric PerTensor quantization. Since the maximum value of P does not exceed 1, the f32tofp8_cast is directly used for quantization.

Performance

cuda driver version: 535.183.06
nvcc version: 12.8
torch version: 2.6

On the H20, MLA typically demonstrate a high degree of arithmetic intensity. Consequently, the Memory Floating - point Utilization (MFU) is employed as a performance metric.

On the H800, MLA typically encounter memory-bound situations. Consequently, the Memory Bandwidth Utilization (MBU) metric is adopted to evaluate the performance of the kernel. There is still a lot of room for optimization on the H800. Look forward to working together.

Reproduction

python3 ./tests/test_flash_mla.py --dtype e4m3

csrc/fp8_transpose_v.h

sijiac · 2025-03-01T04:37:03Z

awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks

endurehero · 2025-03-01T07:08:15Z

awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks

Of course. Already Done

beginlner · 2025-03-01T10:14:09Z

Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA.

endurehero · 2025-03-01T10:29:50Z

Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA.

What about the granularity of PerPageBlock? I can easily adapt it

beginlner · 2025-03-01T10:34:47Z

What about the granularity of PerPageBlock? I can easily adapt it

We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16.

endurehero · 2025-03-01T11:07:27Z

What about the granularity of PerPageBlock? I can easily adapt it

We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16.

Got it!

moses3017 · 2025-04-28T00:18:08Z

What about the granularity of PerPageBlock? I can easily adapt it

We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16.

How about Qnope and Knope using 8-bit quantization, while Qrope and Krope maintain 16-bit data types?

beginlner · 2025-05-21T04:37:03Z

It's acceptable for Qnope and Knope to use per-(1 token × 128 channel) 8-bit quantization, while Qrope and Krope retain 16-bit precision.

shinezyy · 2025-05-21T11:38:47Z

It's acceptable for Qnope and Knope to use per-(1 token × 128 channel) 8-bit quantization, while Qrope and Krope retain 16-bit precision.

The outliners in RoPE cache are also discussed in this paper https://arxiv.org/pdf/2502.01563

Can we add a hadamard transform right after RoPE to distribute outliners to multiple head dims? (https://arxiv.org/abs/2404.00456)

TheTinyTeddy · 2025-06-03T11:41:15Z

It's acceptable for Qnope and Knope to use per-(1 token × 128 channel) 8-bit quantization, while Qrope and Krope retain 16-bit precision.

What precision should S×V be? BF16×BF16 or BF16×FP8 or FP8×FP8 per(1 token × 128 channel)?

chenhongmin.will added 28 commits February 24, 2025 21:12

init fp8

dae0690

enable fp8

d833dbd

update gmem

b67a18f

fp8 shared mem

fed0499

enable fp8 compile

7409203

fix compile

c50d29d

enable fp8 api

dfe8ffc

add fp8 ut

8704188

update ut

ef644a5

update fp8 api

4b314cd

change to use per_tensor

f6fab1b

debug mode

29de9e0

fix Vt illegal

59f6917

add transv barrier

6a4eb63

add TransV

6dcea49

fix sV

dbd8c30

try fix

1757a6d

use mm1's Aregs instead of mma0's Cregs

d1689ab

use 64x64 transpose_v

855c985

fix compile

1df91af

reorg

0337732

use fa'3 transv

061af5f

fix mma0

fd1e662

fix combine

bfe38ab

reorg ut

4e055a6

enable scale

8b93985

Merge branch 'main' into will_fp8_mr

c7143a7

update readme

9887a55

endurehero closed this Feb 28, 2025

endurehero changed the title ~~support fp8~~ fp8 support Feb 28, 2025

update ut

9028983

endurehero reopened this Feb 28, 2025

endurehero mentioned this pull request Feb 28, 2025

FP8 Support #56

Open

tridao reviewed Feb 28, 2025

View reviewed changes

csrc/fp8_transpose_v.h Show resolved Hide resolved

update desc

6199b0b

endurehero force-pushed the will_fp8_mr branch from 9c088fe to 6199b0b Compare February 28, 2025 23:54

add env

7fafcd2

endurehero force-pushed the will_fp8_mr branch from 1eddaa3 to 7fafcd2 Compare March 1, 2025 07:00

beginlner closed this Mar 11, 2025

josephrocca mentioned this pull request Jun 5, 2025

[Bug]: FlashMLA V1 with FP8 KV cache not yet supported! vllm-project/vllm#18887

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fp8 support #54

fp8 support #54

Uh oh!

endurehero commented Feb 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

sijiac commented Mar 1, 2025

Uh oh!

endurehero commented Mar 1, 2025 •

edited

Loading

Uh oh!

beginlner commented Mar 1, 2025 •

edited

Loading

Uh oh!

endurehero commented Mar 1, 2025

Uh oh!

beginlner commented Mar 1, 2025 •

edited

Loading

Uh oh!

endurehero commented Mar 1, 2025

Uh oh!

moses3017 commented Apr 28, 2025

Uh oh!

beginlner commented May 21, 2025 •

edited

Loading

Uh oh!

shinezyy commented May 21, 2025 •

edited

Loading

Uh oh!

TheTinyTeddy commented Jun 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

fp8 support #54

fp8 support #54

Uh oh!

Conversation

endurehero commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Functionality

Performance

Reproduction

Uh oh!

Uh oh!

sijiac commented Mar 1, 2025

Uh oh!

endurehero commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beginlner commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

endurehero commented Mar 1, 2025

Uh oh!

beginlner commented Mar 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

endurehero commented Mar 1, 2025

Uh oh!

moses3017 commented Apr 28, 2025

Uh oh!

beginlner commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shinezyy commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTinyTeddy commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

endurehero commented Feb 28, 2025 •

edited

Loading

endurehero commented Mar 1, 2025 •

edited

Loading

beginlner commented Mar 1, 2025 •

edited

Loading

beginlner commented Mar 1, 2025 •

edited

Loading

beginlner commented May 21, 2025 •

edited

Loading

shinezyy commented May 21, 2025 •

edited

Loading

TheTinyTeddy commented Jun 3, 2025 •

edited

Loading