performance difference based on number of tokens to process #6777
-
good evening. I was trying to learn a little about llama.cpp library and stumbled upon this. This is a small code change based on examples/simple : okuvshynov@63cd5b5 adds N mock tokens on every llama_decode. All the tests were on m2 ultra and mistral-7b-v0.1.Q8_0 model. A question is - why is the difference between 0 and 1 so dramatic on GPU? Is there any optimization done specifically for this scenario, as I assume it's quite common? Or maybe i just misconfigured something? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
The Metal backend uses 2 types of kernels to perform matrix multiplication:
The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS. There is a break-even point for certain BS where one kernel becomes more efficient than the other: I don't know how to determine that break-even point, so currently we always use mat-vec for BS=1 and mat-mat for all other BS which is certainly not optimal, but I don't know how to improve this. |
Beta Was this translation helpful? Give feedback.
The Metal backend uses 2 types of kernels to perform matrix multiplication:
The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS.
The later is inefficient for small BS, but becomes very efficient for large BS.
There is a break-even point for certain BS where one kernel becomes more efficient than the other:
https://github.com/ggerganov/llama.cpp/blob/e8d35f47cb8cb4002fca02e18aaa1cb9fa21d6f1/ggml-metal.m#L1422-L1447
I don't know how to determine that break-even point, so currently we always use mat-vec for BS=1 and mat-mat for all other BS which is certainly not optimal, but I don't know how to improve this.