diff --git a/site-src/api-types/inferencemodel.md b/site-src/api-types/inferencemodel.md index 12a351b0..54fe5739 100644 --- a/site-src/api-types/inferencemodel.md +++ b/site-src/api-types/inferencemodel.md @@ -7,8 +7,13 @@ ## Background -TODO +An InferenceModel allows the Inference Workload Owner to define: + +- Which Model/LoRA adapter(s) to consume. + - Mapping from a client facing model name to the target model name in the InferencePool. + - InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out. +- Criticality of the requests to the InferenceModel. ## Spec -TODO \ No newline at end of file +The full spec of the InferenceModel is defined [here](/reference/spec/#inferencemodel). \ No newline at end of file diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index fdae7e05..baa604b6 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -7,12 +7,28 @@ ## Background -InferencePool is +The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin. + +It is expected for the InferencePool to: + + - Enforce fair consumption of resources across competing workloads + - Efficiently route requests across shared compute (as displayed by the PoC) + +It is _not_ expected for the InferencePool to: + + - Enforce any common set of adapters or base models are available on the Pods + - Manage Deployments of Pods within the Pool + - Manage Pod lifecycle of pods within the pool + +Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests. + +`InferencePool` has some small overlap with `Service`, displayed here: Comparing InferencePool with Service +The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management. ## Spec -TODO \ No newline at end of file +The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool). \ No newline at end of file diff --git a/site-src/concepts/api-overview.md b/site-src/concepts/api-overview.md index f7b50d8b..94e76251 100644 --- a/site-src/concepts/api-overview.md +++ b/site-src/concepts/api-overview.md @@ -1,3 +1,16 @@ # API Overview -TODO \ No newline at end of file +## Bakcground +The Gateway API Inference Extension project is an extension of the Kubernetes Gateway API for serving Generative AI models on Kubernetes. Gateway API Inference Extension facilitates standardization of APIs for Kubernetes cluster operators and developers running generative AI inference, while allowing flexibility for underlying gateway implementations (such as Envoy Proxy) to iterate on mechanisms for optimized serving of models. + +Overview of API integration + +## API Resources + +### InferencePool + +InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferenceModel, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool.md) or go directly to the [InferencePool spec](/reference/spec/#inferencepool). + +### InferenceModel + +An InferenceModel represents a model or adapter, and configuration associated with that model. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our [InferenceModel documentation](/api-types/inferencemodel.md) or go directly to the [InferenceModel spec](/reference/spec/#inferencemodel). diff --git a/site-src/concepts/roles-and-personas.md b/site-src/concepts/roles-and-personas.md index 4a344e1f..b11f43eb 100644 --- a/site-src/concepts/roles-and-personas.md +++ b/site-src/concepts/roles-and-personas.md @@ -1,3 +1,26 @@ # Roles and Personas -TODO \ No newline at end of file +Before diving into the details of the API, decriptions of the personas these APIs were designed for will help convey the thought process of the API design. + +## Inference Platform Admin + +The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads. Including handling Ops for: + + - Hardware + - Model Server + - Base Model + - Resource Allocation for Workloads + - Gateway configuration + - etc + +## Inference Workload Owner + +An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes: + +- Defining criticality +- Managing fine-tunes + - LoRA Adapters + - System Prompts + - Prompt Cache + - etc. +- Managing rollout of adapters diff --git a/site-src/images/inference-overview.svg b/site-src/images/inference-overview.svg new file mode 100644 index 00000000..a82c09e2 --- /dev/null +++ b/site-src/images/inference-overview.svg @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/site-src/index.md b/site-src/index.md index d8dfb773..04d1fadb 100644 --- a/site-src/index.md +++ b/site-src/index.md @@ -11,28 +11,24 @@ they are expected to manage: Gateway API Inference Extension Resource Model +## Key Features +Gateway API Inference Extension, along with a reference implementation in Envoy Proxy, provides the following key features: + +- **Model-aware routing**: Instead of simply routing based on the path of the request, Gateway API Inference Extension allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models. + +- **Serving priority**: Gateway API Inference Extension allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [*Criticality*](/reference/spec/#criticality) than a model for latency tolerant tasks such as a summarization. + +- **Model rollouts**: Gateway API Inference Extension allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names. + +- **Extensibility for Inference Services**: Gateway API Inference Extension defines extensibility pattern for additional Inference services to create bespoke routing capabilities should out of the box solutions not fit your needs. + + +- **Customizable Load Balancing for Inference**: Gateway API Inference Extension defines a pattern for customizable load balancing and request routing that is optimized for Inference. Gateway API Inference Extension provides a reference implementation of model endpoint picking leveraging metrics emitted from the model servers. This endpoint picking mechanism can be used in lieu of traditional load balancing mechanisms. Model Server-aware load balancing ("smart" load balancing as its sometimes referred to in this repo) has been proven to reduce the serving latency and improve utilization of accelerators in your clusters. + + ## API Resources -### InferencePool - -InferencePool represents a set of Inference-focused Pods and an extension that -will be used to route to them. Within the broader Gateway API resource model, -this resource is considered a "backend". In practice, that means that you'd -replace a Kubernetes Service with an InferencePool. This resource has some -similarities to Service (a way to select Pods and specify a port), but will -expand to have some inference-specific capabilities. When combined with -InferenceModel, you can configure a routing extension as well as -inference-specific routing optimizations. For more information on this resource, -refer to our [InferencePool documentation](/api-types/inferencepool). - -### InferenceModel - -An InferenceModel represents a model or adapter, and its associated -configuration. This resource enables you to configure the relative criticality -of a model, and allows you to seamlessly translate the requested model name to -one or more backend model names. Multiple InferenceModels can be attached to an -InferencePool. For more information on this resource, refer to our -[InferenceModel documentation](/api-types/inferencemodel). +Head to our [API overview](/concepts/api-overview/#api-overview) to start exploring our APIs! ## Composable Layers