Update README.md

smarterclayton · web-flow · commit 7ddd14f84386 · 2025-02-20T12:05:44.000-05:00
diff --git a/README.md b/README.md
@@ -1,18 +1,20 @@
 # Gateway API Inference Extension 
 
-The Gateway API Inference Extension - also known as an inference gateway - improves the tail latency and throughput of LLM completion requests in the OpenAI protocol against Kubernetes-hosted model servers. It provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators.
+This extension upgrades an [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)-capable cluster gateway - such as Envoy Gateway, kGateway, or the GKE Gateway - to become an **inference gateway** - supporting inference platform teams self-hosting large language models on Kubernetes. This integration makes it easy to expose and control access to your local [OpenAI-compatible chat completion endpoints](https://platform.openai.com/docs/api-reference/chat) to other workloads on or off cluster, or to integrate your self-hosted models alongside model-as-a-service providers in a higher level **AI Gateway** like LiteLLM, Solo AI Gateway, or Apigee.
 
-The inference gateway is intended for inference platform teams serving self-hosted large language models on Kubernetes. It requires a version of vLLM that supports the necessary metrics to predict traffic.  It extends a cluster-local gateway supporting [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) - such as Envoy Gateway, kGateway, or the GKE Gateway - with a request scheduling algorithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing when model servers are highly loaded. The HttpRoute that accepts OpenAI-compatible requests and serves model responses can then be configured as a model provider underneath a higher level AI-Gateway like LiteLLM, Solo AI Gateway, or Apigee, allowing you to integrate local serving with model-as-a-service consumption.
+The inference gateway improves the tail latency and throughput of LLM completion requests against Kubernetes-hosted model servers using an extensible request scheduling alogrithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing as load increases. It provides Kubernetes-native declarative APIs to route client model names to use-case specific LoRA adapters and control incremental rollout of new adapter versions, A/B traffic splitting, and safe blue-green base model and model server upgrades. By adding operational guardrails like priority and fairness to different client model names, the inference gateway allows a platform team to safely serve many different GenAI workloads on the same pool of shared foundation model servers for higher utilization and fewer required accelerators.
 
-See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation.
+It currently requires a version of vLLM that supports the necessary metrics to predict traffic load which is defined in the [model server protocol](https://docs.google.com/document/d/18VRJ2ufZmAwBZ2jArfvGjQGaWtsQtAP6_yF2Xn6zcms/edit?tab=t.0#heading=h.i6dojwsuaskj).  Support for Jetspeed, nVidia Triton, text-generation-inference, and SGLang is coming soon.
 
-## Status
+## Getting Started
 
-This project is currently under development and we have released our first [alpha 0.1 release](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.1.0).  It should not be used in production.
+Follow our [Getting Started Guide](./pkg/README.md) to get the inference-extension up and running on your cluster!
 
-## Getting Started
+See our website at https://gateway-api-inference-extension.sigs.k8s.io/ for detailed API documentation on leveraging our Kubernetes-tavie declarative APIs
+
+## Status
 
-Follow this [README](./pkg/README.md) to get the inference-extension up and running on your cluster!
+This project is [alpha (0.1 release)](https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/tag/v0.1.0).  It should not be used in production yet.
 
 ## Roadmap