Skip to content

Commit c8c7f1f

Browse files
Draft a revised README.md
Clarify the point of the project, and use the vernacular of "inference gateway" vs "ai gateway" to more succinctly explain what the distinction is. Move the website up more prominently, and describe in more detail what the immediate requirements are. Create a stub roadmap section. Add a medium complexity architecture SVG to the readme
1 parent a78c768 commit c8c7f1f

File tree

2 files changed

+24
-8
lines changed

2 files changed

+24
-8
lines changed

README.md

+23-8
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,39 @@
11
# Gateway API Inference Extension
22

3-
The Gateway API Inference Extension came out of [wg-serving](https://github.com/kubernetes/community/tree/master/wg-serving) and is sponsored by [SIG Network](https://github.com/kubernetes/community/blob/master/sig-network/README.md#gateway-api-inference-extension). This repo contains: the load balancing algorithm, [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) code, CRDs, and controllers of the extension.
3+
This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
44

5-
This extension is intented to provide value to multiplexed LLM services on a shared pool of compute. See the [proposal](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/012-llm-instance-gateway) for more info.
5+
The inference gateway:
6+
7+
* Improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing as load increases
8+
* Provides [Kubernetes-native declarative APIs](https://gateway-api-inference-extension.sigs.k8s.io/concepts/api-overview/) to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades
9+
* Adds end to end observability around service objective attainment
10+
* Ensures operational guardrails like priority and fairness across different client model names, allowing inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators
11+
12+
![Architecture Diagram](./docs/inference-gateway-architecture.svg)
13+
14+
It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-endpoint-picker-protocol). Support for Jetspeed, nVidia Triton, text-generation-inference, and SGLang is coming soon.
615

716
## Status
817

9-
This project is currently in development.
18+
This project is [alpha (0.1 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.1.0). It should not be used in production yet.
1019

1120
## Getting Started
1221

13-
Follow this [README](./pkg/README.md) to get the inference-extension up and running on your cluster!
22+
Follow our [Getting Started Guide](./pkg/README.md) to get the inference-extension up and running on your cluster!
1423

15-
## End-to-End Tests
24+
See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-native declarative APIs
1625

17-
Follow this [README](./test/e2e/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster.
26+
## Status
1827

19-
## Website
28+
This project is [alpha (0.1 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.1.0). It should not be used in production yet.
2029

21-
Detailed documentation is available on our website: https://gateway-api-inference-extension.sigs.k8s.io/
30+
## Roadmap
31+
32+
Coming soon!
33+
34+
## End-to-End Tests
35+
36+
Follow this [README](./test/e2e/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster.
2237

2338
## Contributing
2439

docs/inference-gateway-architecture.svg

+1
Loading

0 commit comments

Comments
 (0)