You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AFAIK, we have already supported batching but do not have a benchmark for it yet. We should review the implementation to see if can do any improvements.
Discussion
Resources
The text was updated successfully, but these errors were encountered:
Each request(1 prompt/ 1 request) sent to server will be prepared and add to task queue. There is a background process gathers prompts in the task queue, build batch and process batch then push result to output queue.
=> Current implementation of cortex llama.cpp can support batching but need to adjust some params to sync with latest llama.cpp implementation and add doc to Readme.md. Also a benchmark script test run batch to verify implementation is needed.
Result when run script in 3090 Linux:
{'message': 'Model already loaded'}
Finished in 27.825968503952026 s
Total token: 6108
Throughput when run parallel: 219.50718441776795 tokens/s
############################
Finished in 38.07835125923157 s
Total token: 4966
Throughput when run in sequence: 130.4153104264477 tokens/s
###########################
--- 70.19260907173157 seconds ---
Motivation
Discussion
Resources
The text was updated successfully, but these errors were encountered: