-
Notifications
You must be signed in to change notification settings - Fork 838
fp8 support #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fp8 support #54
Conversation
9c088fe
to
6199b0b
Compare
awesome, did you mind adding a compile flag to save the time when FP8 is not needed? Thanks |
Of course. Already Done |
Great work! However, I can’t merge this PR at the moment because, based on our tests, per-sequence kvcache scaling significantly reduces accuracy for MLA. |
What about the granularity of PerPageBlock? I can easily adapt it |
We think PerPageBlock is neither enough. kv_rope (64) needs to be bf16. |
Got it! |
How about Qnope and Knope using 8-bit quantization, while Qrope and Krope maintain 16-bit data types? |
It's acceptable for Qnope and Knope to use per-(1 token × 128 channel) 8-bit quantization, while Qrope and Krope retain 16-bit precision. |
The outliners in RoPE cache are also discussed in this paper https://arxiv.org/pdf/2502.01563 Can we add a hadamard transform right after RoPE to distribute outliners to multiple head dims? (https://arxiv.org/abs/2404.00456) |
What precision should S×V be? BF16×BF16 or BF16×FP8 or FP8×FP8 per(1 token × 128 channel)? |
Functionality
Support FP8 WGMMA based on the async pipeline design of FlashMLA. The TransV part draws on the implementation of SmemTranspose64x64 in Fa3.
Currently, Q/K/V only support symmetric PerTensor quantization. Since the maximum value of P does not exceed 1, the f32tofp8_cast is directly used for quantization.
Performance
On the H20, MLA typically demonstrate a high degree of arithmetic intensity. Consequently, the Memory Floating - point Utilization (MFU) is employed as a performance metric.

On the H800, MLA typically encounter memory-bound situations. Consequently, the Memory Bandwidth Utilization (MBU) metric is adopted to evaluate the performance of the kernel. There is still a lot of room for optimization on the H800. Look forward to working together.

Reproduction