-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add priority based scheduling #25
Conversation
/label tide/merge-method-squash |
the y-axis label for the middle graphs looks like it is indicating throughput, I guess you meant to say average per token latency in minutes? /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, liu-cong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Add priority based scheduling * Use the least kv cache for sheddable requests when there is capacity
* Add priority based scheduling * Use the least kv cache for sheddable requests when there is capacity
Benchmark results for critical vs. sheddable requests. Pls find more details of the benchmark in this doc:
https://docs.google.com/document/d/11ALHEF-9yOaLdbHbDjBoTY6fzejoEKiSYHzWpWqe8ZY/edit?tab=t.0#bookmark=id.gx5ki34wov48
Critical request:

Sheddable requests:

Success ratio for sheddable requests: