|
1 | 1 | # Gateway API Inference Extension
|
2 | 2 |
|
3 |
| -The Gateway API Inference Extension came out of [wg-serving](https://github.com/kubernetes/community/tree/master/wg-serving) and is sponsored by [SIG Network](https://github.com/kubernetes/community/blob/master/sig-network/README.md#gateway-api-inference-extension). This repo contains: the load balancing algorithm, [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) code, CRDs, and controllers of the extension. |
| 3 | +This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable proxy or gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee. |
4 | 4 |
|
5 |
| -This extension is intented to provide value to multiplexed LLM services on a shared pool of compute. See the [proposal](https://github.com/kubernetes-sigs/wg-serving/tree/main/proposals/012-llm-instance-gateway) for more info. |
| 5 | +The inference gateway improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing as load increases. It provides [Kubernetes-native declarative APIs](https://gateway-api-inference-extension.sigs.k8s.io/concepts/api-overview/) to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding deep service observability and operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators. |
| 6 | + |
| 7 | + |
| 8 | + |
| 9 | +It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the [model server protocol](https://docs.google.com/document/d/18VRJ2ufZmAwBZ2jArfvGjQGaWtsQtAP6_yF2Xn6zcms/edit?tab=t.0#heading=h.i6dojwsuaskj). Support for Jetspeed, nVidia Triton, text-generation-inference, and SGLang is coming soon. |
6 | 10 |
|
7 | 11 | ## Status
|
8 | 12 |
|
9 |
| -This project is currently in development. |
| 13 | +This project is currently in [alpha](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.1.0). It should not be used in production. |
10 | 14 |
|
11 | 15 | ## Getting Started
|
12 | 16 |
|
13 |
| -Follow this [README](./pkg/README.md) to get the inference-extension up and running on your cluster! |
| 17 | +Follow our [Getting Started Guide](./pkg/README.md) to get the inference-extension up and running on your cluster! |
14 | 18 |
|
15 |
| -## End-to-End Tests |
| 19 | +See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-native declarative APIs |
16 | 20 |
|
17 |
| -Follow this [README](./test/e2e/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster. |
| 21 | +## Status |
18 | 22 |
|
19 |
| -## Website |
| 23 | +This project is [alpha (0.1 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.1.0). It should not be used in production yet. |
20 | 24 |
|
21 |
| -Detailed documentation is available on our website: https://gateway-api-inference-extension.sigs.k8s.io/ |
| 25 | +## Roadmap |
| 26 | + |
| 27 | +Coming soon! |
| 28 | + |
| 29 | +## End-to-End Tests |
| 30 | + |
| 31 | +Follow this [README](./test/e2e/README.md) to learn more about running the inference-extension end-to-end test suite on your cluster. |
22 | 32 |
|
23 | 33 | ## Contributing
|
24 | 34 |
|
|
0 commit comments