[Bug] MLA kernel fails the tests in tests/test_deepseek_mla.py #949

Atream · 2025-03-17T03:36:44Z

The MLA kernel fails the tests in tests/test_deepseek_mla.py. I used the current main branch with commit 27906fd, but it cannot pass the unit tests in tests/test_deepseek_mla.py. The output in the integrated system is also abnormal. After reverting to the previous commit 061db55, everything works fine.

Mismatched elements: 8318200 / 8388608 (99.2%)
Greatest absolute difference: 1.974609375 at index (0, 44, 501) (up to 0.001 allowed)
Greatest relative difference: inf at index (0, 18, 121) (up to 0.001 allowed)

Environment

RTX 4090, CUDA 12.4, torch 2.5.1
Fail in test_batch_mla_varlen_page_attention, test_batch_mla_varlen_page_attention, test_batch_mla_page_attention on BFloat16.
To test on 4090, I remove if not is_sm90a_supported(torch.device("cuda")) check.

The text was updated successfully, but these errors were encountered:

yzh119 · 2025-03-17T03:40:19Z

Thanks for reporting the issue, I'll fix it soon and add mla unittests to CI.

yzh119 · 2025-03-17T05:14:23Z

Hi @Atream , I can't reproduce the issue, can you show me the exact test cast (the batch_size/kv_len/qo_len/etc in test_batch_mla_page_attention) that generates the wrong outputs:

Mismatched elements: 8318200 / 8388608 (99.2%)
Greatest absolute difference: 1.974609375 at index (0, 44, 501) (up to 0.001 allowed)
Greatest relative difference: inf at index (0, 18, 121) (up to 0.001 allowed)

To test on 4090, I remove if not is_sm90a_supported(torch.device("cuda")) check.

4090 do not support fa3 which relies on wgmma which is not available in sm90a (4090 has sm89), and you can try fa2 backend in this case.

Atream · 2025-03-17T06:09:43Z

Hi @Atream , I can't reproduce the issue, can you show me the exact test cast (the batch_size/kv_len/qo_len/etc in test_batch_mla_page_attention) that generates the wrong outputs:
Mismatched elements: 8318200 / 8388608 (99.2%)
Greatest absolute difference: 1.974609375 at index (0, 44, 501) (up to 0.001 allowed)
Greatest relative difference: inf at index (0, 18, 121) (up to 0.001 allowed)
To test on 4090, I remove if not is_sm90a_supported(torch.device("cuda")) check.

4090 do not support fa3 which relies on wgmma which is not available in sm90a (4090 has sm89), and you can try fa2 backend in this case.

I run this:

test_batch_mla_page_attention(1, 1024, 128, 128, False, 1, "fa2", True, torch.bfloat16)

yzh119 · 2025-03-17T06:37:54Z

Should have been fixed in #951 , you can check the unittest status at https://ci.tlcpack.ai/blue/organizations/jenkins/flashinfer-ci/detail/PR-951/2/pipeline (GPU-G5-Test-4).

Atream · 2025-03-17T07:08:35Z

I tested on my env.

test_batch_mla_page_attention(1, 1024, 128, 128, True, 1, "fa2", True, torch.bfloat16)

Mismatched elements: 33698 / 8388608 (0.4%)
Greatest absolute difference: 0.0078125 at index (0, 123, 162) (up to 0.001 allowed)
Greatest relative difference: 0.048828125 at index (30, 32, 363) (up to 0.001 allowed)

test_batch_mla_varlen_page_attention(1, 65, 65, 65, 1, 128, True, 64, "fa2", torch.bfloat16)
Mismatched elements: 7082 / 65536 (10.8%)
Greatest absolute difference: 0.015625 at index (0, 108, 276) (up to 0.001 allowed)
Greatest relative difference: 18.75 at index (0, 119, 158) (up to 0.001 allowed)

yzh119 · 2025-03-17T07:31:37Z

Hi @Atream that's desirable because the original atol and rtol are designed for fp16. bf16 have larger errors inherently (as studied in https://arxiv.org/abs/2405.02803), usually we can tolerate 2e-2 difference for bf16 unittests.

For bf16 unittests, we need to increase the atol and rtol correspondingly.

The end-to-end evaluation after #951 should be normal.

Atream · 2025-03-17T07:49:37Z

It works fine. Thank you for your quick fix.

@Atream

The sm86/sm89 version of mla kernel was not tests after change #942, this PR fixes the issue. This PR also make the following changes: 1. adding the mla unittest to CI (on a10g node). 2. shrinking the unittest of mla so that CI can finish in reasonable time. 3. change `is_sm90a_supported(torch.device("cuda"))` to `backend == "fa3" and not is_sm90a_supported(torch.device("cuda")):` for non-hopper GPUs, as pointed by @Atream .

@Atream

The sm86/sm89 version of mla kernel was not tests after change flashinfer-ai#942, this PR fixes the issue. This PR also make the following changes: 1. adding the mla unittest to CI (on a10g node). 2. shrinking the unittest of mla so that CI can finish in reasonable time. 3. change `is_sm90a_supported(torch.device("cuda"))` to `backend == "fa3" and not is_sm90a_supported(torch.device("cuda")):` for non-hopper GPUs, as pointed by @Atream .

Atream closed this as completed Mar 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] MLA kernel fails the tests in tests/test_deepseek_mla.py #949

[Bug] MLA kernel fails the tests in tests/test_deepseek_mla.py #949

Atream commented Mar 17, 2025

yzh119 commented Mar 17, 2025

yzh119 commented Mar 17, 2025

Atream commented Mar 17, 2025

yzh119 commented Mar 17, 2025 •

edited

Loading

Atream commented Mar 17, 2025

yzh119 commented Mar 17, 2025 •

edited

Loading

Atream commented Mar 17, 2025

[Bug] MLA kernel fails the tests in tests/test_deepseek_mla.py #949

[Bug] MLA kernel fails the tests in tests/test_deepseek_mla.py #949

Comments

Atream commented Mar 17, 2025

Environment

yzh119 commented Mar 17, 2025

yzh119 commented Mar 17, 2025

Atream commented Mar 17, 2025

yzh119 commented Mar 17, 2025 • edited Loading

Atream commented Mar 17, 2025

yzh119 commented Mar 17, 2025 • edited Loading

Atream commented Mar 17, 2025

yzh119 commented Mar 17, 2025 •

edited

Loading

yzh119 commented Mar 17, 2025 •

edited

Loading