Benchmarking MLA

Hi, I was trying to benchmark the MLA attention decoding kernel with the following script (adding CUDA timing events around BatchDecodeMlaWithPagedKVCacheWrapper::run in test_mla_decode_kernel). https://gist.github.com/YLGH/e8ebd7577d12f6c7963bcbae95e3b781

However, I'm seeing some very low numbers, such as an effective memory throughput of 50 GiB/s for batch_size=32, kv_len=16k, page_size=16. I feel like I'm mis-using it somehow, but I couldn't find any examples of BatchDecodeMlaWithPagedKVCacheWrapper being used in not test code so not sure if I am doing it correctly. I couldn't get the nvbench suite compiled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking MLA #700

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmarking MLA #700

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions