Closed
Description
Hi, I was trying to benchmark the MLA attention decoding kernel with the following script (adding CUDA timing events around BatchDecodeMlaWithPagedKVCacheWrapper::run in test_mla_decode_kernel). https://gist.github.com/YLGH/e8ebd7577d12f6c7963bcbae95e3b781
However, I'm seeing some very low numbers, such as an effective memory throughput of 50 GiB/s for batch_size=32, kv_len=16k, page_size=16. I feel like I'm mis-using it somehow, but I couldn't find any examples of BatchDecodeMlaWithPagedKVCacheWrapper being used in not test code so not sure if I am doing it correctly. I couldn't get the nvbench suite compiled