-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft a revised README.md #374
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is a great update, thanks @smarterclayton! /lgtm |
7ddd14f
to
0953ded
Compare
ba1b2fc
to
c8c7f1f
Compare
c8c7f1f
to
7975dc4
Compare
README.md
Outdated
This extension is intented to provide value to multiplexed LLM services on a shared pool of compute. See the [proposal](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/012-llm-instance-gateway) for more info. | ||
The inference gateway: | ||
|
||
* Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing as load increases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do we mean by request weight here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we consider token size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we don't; we consider the current kv-cache utilization and queue length at the model servers
Clarify the point of the project, and use the vernacular of "inference gateway" vs "ai gateway" to more succinctly explain what the distinction is. Move the website up more prominently, and describe in more detail what the immediate requirements are. Create a stub roadmap section. Add a medium complexity architecture SVG to the readme
7975dc4
to
fe96ba8
Compare
All changes are made, I think this is ok to merge barring any final review. |
/hold cancel |
/lgtm |
Clarify the point of the project, and use the vernacular of "inference gateway" vs "ai gateway" to more succinctly explain what the distinction is. Move the website up more prominently, and describe in more detail what the immediate requirements are. Create a stub roadmap section.
I intended this to start a discussion, but I'd like to see the readme more immediately answer what the purpose of the project is. I agree the discussion around inference gateway has past history, but I also don't think "gateway-api-inference-extension" is landing when I talk about this with potential users and I want to try some variations.
@ahg-g @danehans @kfswain @robscott