performance difference based on number of tokens to process #6777

okuvshynov · 2024-04-20T02:23:50Z

okuvshynov
Apr 20, 2024

good evening. I was trying to learn a little about llama.cpp library and stumbled upon this.

This is a small code change based on examples/simple :

okuvshynov@63cd5b5 adds N mock tokens on every llama_decode.
We ignore the output for those fake tokens, clean the cache for them and just keep going as before.

All the tests were on m2 ultra and mistral-7b-v0.1.Q8_0 model.

A question is - why is the difference between 0 and 1 so dramatic on GPU? Is there any optimization done specifically for this scenario, as I assume it's quite common? Or maybe i just misconfigured something?

Answered by ggerganov

Apr 21, 2024

The Metal backend uses 2 types of kernels to perform matrix multiplication:

mat-vec
mat-mat

The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS.
The later is inefficient for small BS, but becomes very efficient for large BS.

There is a break-even point for certain BS where one kernel becomes more efficient than the other:

https://github.com/ggerganov/llama.cpp/blob/e8d35f47cb8cb4002fca02e18aaa1cb9fa21d6f1/ggml-metal.m#L1422-L1447

I don't know how to determine that break-even point, so currently we always use mat-vec for BS=1 and mat-mat for all other BS which is certainly not optimal, but I don't know how to improve this.

View full answer

ggerganov · 2024-04-21T12:42:54Z

ggerganov
Apr 21, 2024
Maintainer

The Metal backend uses 2 types of kernels to perform matrix multiplication:

mat-vec
mat-mat

The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS.
The later is inefficient for small BS, but becomes very efficient for large BS.

There is a break-even point for certain BS where one kernel becomes more efficient than the other:

https://github.com/ggerganov/llama.cpp/blob/e8d35f47cb8cb4002fca02e18aaa1cb9fa21d6f1/ggml-metal.m#L1422-L1447

I don't know how to determine that break-even point, so currently we always use mat-vec for BS=1 and mat-mat for all other BS which is certainly not optimal, but I don't know how to improve this.

2 replies

okuvshynov Apr 21, 2024
Author

Thank you, that makes a lot of sense!

afsara-ben Apr 24, 2025

@ggerganov what is the parameter used to identify that break-even point? ne11_mm_min?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance difference based on number of tokens to process #6777

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

performance difference based on number of tokens to process #6777

okuvshynov Apr 20, 2024

Replies: 1 comment · 2 replies

ggerganov Apr 21, 2024 Maintainer

okuvshynov Apr 21, 2024 Author

afsara-ben Apr 24, 2025

okuvshynov
Apr 20, 2024

Replies: 1 comment 2 replies

ggerganov
Apr 21, 2024
Maintainer

okuvshynov Apr 21, 2024
Author