Skip to content

Populating api-types & concepts #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 31, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions site-src/api-types/inferencemodel.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,13 @@

## Background

TODO
An InferenceModel allows the Inference Workload Owner to define:
- Which Model/LoRA adapter(s) to consume .
- Mapping from a client facing model name to the target model name in the InferencePool.
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out.
- Criticality of the requests to the InferenceModel.
- The InferencePools this InferenceModel is relevant to.

## Spec

TODO
The full spec of the InferenceModel is defined [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/api/v1alpha1/inferencemodel_types.go).
18 changes: 16 additions & 2 deletions site-src/api-types/inferencepool.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,26 @@

## Background

InferencePool is
The InferencePool at its core is a logical grouping of compute, expressed in the form of Pods (typically model servers), akin to a K8s Service. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin.

It is expected for the InferencePool to:
- Enforce fair consumption of resources across competing workloads
- Efficiently route requests across shared compute (as displayed by the PoC)

It is _not_ expected for the InferencePool to:
- Enforce any common set of adapters or base models are available on the Pods
- Manage Deployments of Pods within the Pool
- Manage Pod lifecycle of pods within the pool

Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.

The inference pool had some small overlap with the `Service` spec, displayed here:

<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
<img src="/images/inferencepool-vs-service.png" alt="Comparing InferencePool with Service" class="center" width="550" />

The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus on less on networking, and more on Pool management.

## Spec

TODO
The full spec of the InferencePool is defined [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/api/v1alpha1/inferencepool_types.go).
11 changes: 10 additions & 1 deletion site-src/concepts/api-overview.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,12 @@
# API Overview

TODO
## Background
The API design is based on these axioms:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion is to focus more on the relationship between the two apis and links to the apis docs we have above and less about how we ended up here (i.e., the design) since this is more of a user-facing documentation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points, took a stab here. PTAL


- Pools of shared compute should be *discrete* for scheduling to properly work
- Pod-level scheduling should not be handled by a high-level gateway
- Simple services should be simple to define (or are implicitly defined via reasonable defaults)
- This solution should be composable with other Gateway solutions and flexible to fit customer needs
- The MVP will heavily assume requests are done using the OpenAI spec, but open to extension in the future
- The Gateway should route in a way that does not generate a queue of requests at the model server level
- Model serving differs from web-serving in critical ways. One of these is the existence of multiple models for the same service, which can materially impact behavior, depending on the model served. As opposed to a web-service that has mechanisms to render implementation changes invisible to an end user
23 changes: 22 additions & 1 deletion site-src/concepts/roles-and-personas.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,24 @@
# Roles and Personas

TODO
Before diving into the details of the API, decriptions of the personas these APIs were designed for will help convey the thought process of the API design.

## Inference Platform Admin

The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads. Including handling Ops for:
- Hardware
- Model Server
- Base Model
- Resource Allocation for Workloads
- Gateway configuration
- etc

## Inference Workload Owner

An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
- Defining criticality
- Managing fine-tunes
- LoRA Adapters
- System Prompts
- Prompt Cache
- etc.
- Managing rollout of adapters