Skip to content

performance difference based on number of tokens to process #6777

Answered by ggerganov
okuvshynov asked this question in Q&A
Discussion options

You must be logged in to vote

The Metal backend uses 2 types of kernels to perform matrix multiplication:

  • mat-vec
  • mat-mat

The former is very efficient for batch-size 1 (BS=1) and gets worse with increasing the BS.
The later is inefficient for small BS, but becomes very efficient for large BS.

There is a break-even point for certain BS where one kernel becomes more efficient than the other:

https://github.com/ggerganov/llama.cpp/blob/e8d35f47cb8cb4002fca02e18aaa1cb9fa21d6f1/ggml-metal.m#L1422-L1447

I don't know how to determine that break-even point, so currently we always use mat-vec for BS=1 and mat-mat for all other BS which is certainly not optimal, but I don't know how to improve this.

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@okuvshynov
Comment options

@afsara-ben
Comment options

Answer selected by okuvshynov
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants