Skip to content

Commit 7ddd14f

Browse files
Update README.md
1 parent 2d325e8 commit 7ddd14f

File tree

1 file changed

+9
-7
lines changed

1 file changed

+9
-7
lines changed

README.md

+9-7
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,20 @@
11
# Gateway API Inference Extension
22

3-
The Gateway API Inference Extension - also known as an inference gateway - improves the tail latency and throughput of LLM completion requests in the OpenAI protocol against Kubernetes-hosted model servers. It provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators.
3+
This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable cluster gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
44

5-
The inference gateway is intended for inference platform teams serving self-hosted large language models on Kubernetes. It requires a version of vLLM that supports the necessary metrics to predict traffic. It extends a cluster-local gateway supporting [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) - such as Envoy Gateway, kGateway, or the GKE Gateway - with a request scheduling algorithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing when model servers are highly loaded. The HttpRoute that accepts OpenAI-compatible requests and serves model responses can then be configured as a model provider underneath a higher level AI-Gateway like LiteLLM, Solo AI Gateway, or Apigee, allowing you to integrate local serving with model-as-a-service consumption.
5+
The inference gateway improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing as load increases. It provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators.
66

7-
See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation.
7+
It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the [model server protocol](https://docs.google.com/document/d/18VRJ2ufZmAwBZ2jArfvGjQGaWtsQtAP6_yF2Xn6zcms/edit?tab=t.0#heading=h.i6dojwsuaskj). Support for Jetspeed, nVidia Triton, text-generation-inference, and SGLang is coming soon.
88

9-
## Status
9+
## Getting Started
1010

11-
This project is currently under development and we have released our first [alpha 0.1 release](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.1.0). It should not be used in production.
11+
Follow our [Getting Started Guide](./pkg/README.md) to get the inference-extension up and running on your cluster!
1212

13-
## Getting Started
13+
See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-tavie declarative APIs
14+
15+
## Status
1416

15-
Follow this [README](./pkg/README.md) to get the inference-extension up and running on your cluster!
17+
This project is [alpha (0.1 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.1.0). It should not be used in production yet.
1618

1719
## Roadmap
1820

0 commit comments

Comments
 (0)