-The inference gateway is intended for inference platform teams serving self-hosted large language models on Kubernetes. It requires a version of vLLM that supports the necessary metrics to predict traffic. It extends a cluster-local gateway supporting [ext-proc](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) - such as Envoy Gateway, kGateway, or the GKE Gateway - with a request scheduling algorithm that is both kv-cache and request weight and priority aware, avoiding evictions or queueing when model servers are highly loaded. The HttpRoute that accepts OpenAI-compatible requests and serves model responses can then be configured as a model provider underneath a higher level AI-Gateway like LiteLLM, Solo AI Gateway, or Apigee, allowing you to integrate local serving with model-as-a-service consumption.
0 commit comments