Skip to content

Commit f9d2ef9

Browse files
authored
Populating api-types & concepts (#254)
* Populating api-types & concepts * feedback updates * feedback pass * formatting adjustments * restructuring site, and editing wording * swapping to svg cause crisp lines make brain feel good * fixing name
1 parent 7639e6f commit f9d2ef9

File tree

6 files changed

+80
-26
lines changed

6 files changed

+80
-26
lines changed

Diff for: site-src/api-types/inferencemodel.md

+7-2
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,13 @@
77

88
## Background
99

10-
TODO
10+
An InferenceModel allows the Inference Workload Owner to define:
11+
12+
- Which Model/LoRA adapter(s) to consume.
13+
- Mapping from a client facing model name to the target model name in the InferencePool.
14+
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out.
15+
- Criticality of the requests to the InferenceModel.
1116

1217
## Spec
1318

14-
TODO
19+
The full spec of the InferenceModel is defined [here](/reference/spec/#inferencemodel).

Diff for: site-src/api-types/inferencepool.md

+18-2
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,28 @@
77

88
## Background
99

10-
InferencePool is
10+
The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin.
11+
12+
It is expected for the InferencePool to:
13+
14+
- Enforce fair consumption of resources across competing workloads
15+
- Efficiently route requests across shared compute (as displayed by the PoC)
16+
17+
It is _not_ expected for the InferencePool to:
18+
19+
- Enforce any common set of adapters or base models are available on the Pods
20+
- Manage Deployments of Pods within the Pool
21+
- Manage Pod lifecycle of pods within the pool
22+
23+
Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.
24+
25+
`InferencePool` has some small overlap with `Service`, displayed here:
1126

1227
<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
1328
<img src="/images/inferencepool-vs-service.png" alt="Comparing InferencePool with Service" class="center" width="550" />
1429

30+
The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management.
1531

1632
## Spec
1733

18-
TODO
34+
The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).

Diff for: site-src/concepts/api-overview.md

+14-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,16 @@
11
# API Overview
22

3-
TODO
3+
## Bakcground
4+
The Gateway API Inference Extension project is an extension of the Kubernetes Gateway API for serving Generative AI models on Kubernetes. Gateway API Inference Extension facilitates standardization of APIs for Kubernetes cluster operators and developers running generative AI inference, while allowing flexibility for underlying gateway implementations (such as Envoy Proxy) to iterate on mechanisms for optimized serving of models.
5+
6+
<img src="/images/inference-overview.svg" alt="Overview of API integration" class="center" width="1000" />
7+
8+
## API Resources
9+
10+
### InferencePool
11+
12+
InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferenceModel, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool.md) or go directly to the [InferencePool spec](/reference/spec/#inferencepool).
13+
14+
### InferenceModel
15+
16+
An InferenceModel represents a model or adapter, and configuration associated with that model. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our [InferenceModel documentation](/api-types/inferencemodel.md) or go directly to the [InferenceModel spec](/reference/spec/#inferencemodel).

Diff for: site-src/concepts/roles-and-personas.md

+24-1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,26 @@
11
# Roles and Personas
22

3-
TODO
3+
Before diving into the details of the API, decriptions of the personas these APIs were designed for will help convey the thought process of the API design.
4+
5+
## Inference Platform Admin
6+
7+
The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads. Including handling Ops for:
8+
9+
- Hardware
10+
- Model Server
11+
- Base Model
12+
- Resource Allocation for Workloads
13+
- Gateway configuration
14+
- etc
15+
16+
## Inference Workload Owner
17+
18+
An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
19+
20+
- Defining criticality
21+
- Managing fine-tunes
22+
- LoRA Adapters
23+
- System Prompts
24+
- Prompt Cache
25+
- etc.
26+
- Managing rollout of adapters

Diff for: site-src/images/inference-overview.svg

+1
Loading

Diff for: site-src/index.md

+16-20
Original file line numberDiff line numberDiff line change
@@ -11,28 +11,24 @@ they are expected to manage:
1111
<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
1212
<img src="/images/resource-model.png" alt="Gateway API Inference Extension Resource Model" class="center" width="550" />
1313

14+
## Key Features
15+
Gateway API Inference Extension, along with a reference implementation in Envoy Proxy, provides the following key features:
16+
17+
- **Model-aware routing**: Instead of simply routing based on the path of the request, Gateway API Inference Extension allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models.
18+
19+
- **Serving priority**: Gateway API Inference Extension allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [*Criticality*](/reference/spec/#criticality) than a model for latency tolerant tasks such as a summarization.
20+
21+
- **Model rollouts**: Gateway API Inference Extension allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names.
22+
23+
- **Extensibility for Inference Services**: Gateway API Inference Extension defines extensibility pattern for additional Inference services to create bespoke routing capabilities should out of the box solutions not fit your needs.
24+
25+
26+
- **Customizable Load Balancing for Inference**: Gateway API Inference Extension defines a pattern for customizable load balancing and request routing that is optimized for Inference. Gateway API Inference Extension provides a reference implementation of model endpoint picking leveraging metrics emitted from the model servers. This endpoint picking mechanism can be used in lieu of traditional load balancing mechanisms. Model Server-aware load balancing ("smart" load balancing as its sometimes referred to in this repo) has been proven to reduce the serving latency and improve utilization of accelerators in your clusters.
27+
28+
1429
## API Resources
1530

16-
### InferencePool
17-
18-
InferencePool represents a set of Inference-focused Pods and an extension that
19-
will be used to route to them. Within the broader Gateway API resource model,
20-
this resource is considered a "backend". In practice, that means that you'd
21-
replace a Kubernetes Service with an InferencePool. This resource has some
22-
similarities to Service (a way to select Pods and specify a port), but will
23-
expand to have some inference-specific capabilities. When combined with
24-
InferenceModel, you can configure a routing extension as well as
25-
inference-specific routing optimizations. For more information on this resource,
26-
refer to our [InferencePool documentation](/api-types/inferencepool).
27-
28-
### InferenceModel
29-
30-
An InferenceModel represents a model or adapter, and its associated
31-
configuration. This resource enables you to configure the relative criticality
32-
of a model, and allows you to seamlessly translate the requested model name to
33-
one or more backend model names. Multiple InferenceModels can be attached to an
34-
InferencePool. For more information on this resource, refer to our
35-
[InferenceModel documentation](/api-types/inferencemodel).
31+
Head to our [API overview](/concepts/api-overview/#api-overview) to start exploring our APIs!
3632

3733
## Composable Layers
3834

0 commit comments

Comments
 (0)