Skip to content

Populating api-types & concepts #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 31, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions site-src/api-types/inferencemodel.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,13 @@

## Background

TODO
An InferenceModel allows the Inference Workload Owner to define:

- Which Model/LoRA adapter(s) to consume.
- Mapping from a client facing model name to the target model name in the InferencePool.
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out.
- Criticality of the requests to the InferenceModel.

## Spec

TODO
The full spec of the InferenceModel is defined [here](/reference/spec/#inferencemodel).
20 changes: 18 additions & 2 deletions site-src/api-types/inferencepool.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,28 @@

## Background

InferencePool is
The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin.

It is expected for the InferencePool to:

- Enforce fair consumption of resources across competing workloads
- Efficiently route requests across shared compute (as displayed by the PoC)

It is _not_ expected for the InferencePool to:

- Enforce any common set of adapters or base models are available on the Pods
- Manage Deployments of Pods within the Pool
- Manage Pod lifecycle of pods within the pool

Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.

`InferencePool` has some small overlap with `Service`, displayed here:

<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
<img src="/images/inferencepool-vs-service.png" alt="Comparing InferencePool with Service" class="center" width="550" />

The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management.

## Spec

TODO
The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).
15 changes: 14 additions & 1 deletion site-src/concepts/api-overview.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
# API Overview

TODO
## Bakcground
The Gateway API Inference Extension project is an extension of the Kubernetes Gateway API for serving Generative AI models on Kubernetes. Gateway API Inference Extension facilitates standardization of APIs for Kubernetes cluster operators and developers running generative AI inference, while allowing flexibility for underlying gateway implementations (such as Envoy Proxy) to iterate on mechanisms for optimized serving of models.

<img src="/images/inference-overview.svg" alt="Overview of API integration" class="center" width="1000" />

## API Resources

### InferencePool

InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferenceModel, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool.md) or go directly to the [InferencePool spec](/reference/spec/#inferencepool).

### InferenceModel

An InferenceModel represents a model or adapter, and configuration associated with that model. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our [InferenceModel documentation](/api-types/inferencemodel.md) or go directly to the [InferenceModel spec](/reference/spec/#inferencemodel).
25 changes: 24 additions & 1 deletion site-src/concepts/roles-and-personas.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,26 @@
# Roles and Personas

TODO
Before diving into the details of the API, decriptions of the personas these APIs were designed for will help convey the thought process of the API design.

## Inference Platform Admin

The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads. Including handling Ops for:

- Hardware
- Model Server
- Base Model
- Resource Allocation for Workloads
- Gateway configuration
- etc

## Inference Workload Owner

An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:

- Defining criticality
- Managing fine-tunes
- LoRA Adapters
- System Prompts
- Prompt Cache
- etc.
- Managing rollout of adapters
1 change: 1 addition & 0 deletions site-src/images/inference-overview.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
36 changes: 16 additions & 20 deletions site-src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,28 +11,24 @@ they are expected to manage:
<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
<img src="/images/resource-model.png" alt="Gateway API Inference Extension Resource Model" class="center" width="550" />

## Key Features
Gateway API Inference Extension, along with a reference implementation in Envoy Proxy, provides the following key features:

- **Model-aware routing**: Instead of simply routing based on the path of the request, Gateway API Inference Extension allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models.

- **Serving priority**: Gateway API Inference Extension allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [*Criticality*](/reference/spec/#criticality) than a model for latency tolerant tasks such as a summarization.

- **Model rollouts**: Gateway API Inference Extension allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names.

- **Extensibility for Inference Services**: Gateway API Inference Extension defines extensibility pattern for additional Inference services to create bespoke routing capabilities should out of the box solutions not fit your needs.


- **Customizable Load Balancing for Inference**: Gateway API Inference Extension defines a pattern for customizable load balancing and request routing that is optimized for Inference. Gateway API Inference Extension provides a reference implementation of model endpoint picking leveraging metrics emitted from the model servers. This endpoint picking mechanism can be used in lieu of traditional load balancing mechanisms. Model Server-aware load balancing ("smart" load balancing as its sometimes referred to in this repo) has been proven to reduce the serving latency and improve utilization of accelerators in your clusters.


## API Resources

### InferencePool

InferencePool represents a set of Inference-focused Pods and an extension that
will be used to route to them. Within the broader Gateway API resource model,
this resource is considered a "backend". In practice, that means that you'd
replace a Kubernetes Service with an InferencePool. This resource has some
similarities to Service (a way to select Pods and specify a port), but will
expand to have some inference-specific capabilities. When combined with
InferenceModel, you can configure a routing extension as well as
inference-specific routing optimizations. For more information on this resource,
refer to our [InferencePool documentation](/api-types/inferencepool).

### InferenceModel

An InferenceModel represents a model or adapter, and its associated
configuration. This resource enables you to configure the relative criticality
of a model, and allows you to seamlessly translate the requested model name to
one or more backend model names. Multiple InferenceModels can be attached to an
InferencePool. For more information on this resource, refer to our
[InferenceModel documentation](/api-types/inferencemodel).
Head to our [API overview](/concepts/api-overview/#api-overview) to start exploring our APIs!

## Composable Layers

Expand Down