kubernetes-sigs · k8s-ci-robot · Jan 31, 2025 · Jan 29, 2025 · Jan 30, 2025 · Jan 30, 2025
diff --git a/site-src/api-types/inferencemodel.md b/site-src/api-types/inferencemodel.md
@@ -7,8 +7,13 @@
 
 ## Background
 
-TODO
+An InferenceModel allows the Inference Workload Owner to define:
+- Which Model/LoRA adapter(s) to consume .
+  - Mapping from a client facing model name to the target model name in the InferencePool.
+  - InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out.
+- Criticality of the requests to the InferenceModel.
+- The InferencePools this InferenceModel is relevant to.
 
 ## Spec
 
-TODO
+The full spec of the InferenceModel is defined [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/api/v1alpha1/inferencemodel_types.go).
diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md
@@ -7,12 +7,26 @@
 
 ## Background
 
-InferencePool is
+The InferencePool at its core is a logical grouping of compute, expressed in the form of Pods (typically model servers), akin to a K8s Service. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin. 
+
+ It is expected for the InferencePool to:
+ - Enforce fair consumption of resources across competing workloads
+ - Efficiently route requests across shared compute (as displayed by the PoC)
+
+It is _not_ expected for the InferencePool to:
+ - Enforce any common set of adapters or base models are available on the Pods
+ - Manage Deployments of Pods within the Pool
+ - Manage Pod lifecycle of pods within the pool 
+
+Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.
+
+The inference pool had some small overlap with the `Service` spec, displayed here:
 
 <!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
 <img src="/images/inferencepool-vs-service.png" alt="Comparing InferencePool with Service" class="center" width="550" />
 
+The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus on less on networking, and more on Pool management. 
 
 ## Spec
 
-TODO
+The full spec of the InferencePool is defined [here](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/api/v1alpha1/inferencepool_types.go).
diff --git a/site-src/concepts/api-overview.md b/site-src/concepts/api-overview.md
@@ -1,3 +1,12 @@
 # API Overview
 
-TODO
+## Background
+The API design is based on these axioms:
+
+- Pools of shared compute should be *discrete* for scheduling to properly work
+- Pod-level scheduling should not be handled by a high-level gateway 
+- Simple services should be simple to define (or are implicitly defined via reasonable defaults)
+- This solution should be composable with other Gateway solutions and flexible to fit customer needs
+- The MVP will heavily assume requests are done using the OpenAI spec, but open to extension in the future
+- The Gateway should route in a way that does not generate a queue of requests at the model server level
+- Model serving differs from web-serving in critical ways. One of these is the existence of multiple models for the same service, which can materially impact behavior, depending on the model served. As opposed to a web-service that has mechanisms to render implementation changes invisible to an end user 
diff --git a/site-src/concepts/roles-and-personas.md b/site-src/concepts/roles-and-personas.md
@@ -1,3 +1,24 @@
 # Roles and Personas
 
-TODO
+Before diving into the details of the API, decriptions of the personas these APIs were designed for will help convey the thought process of the API design.
+
+## Inference Platform Admin
+
+The Inference Platform Admin creates and manages the infrastructure necessary to run LLM workloads. Including handling Ops for: 
+  - Hardware
+  - Model Server
+  - Base Model
+  - Resource Allocation for Workloads
+  - Gateway configuration
+  - etc
+
+## Inference Workload Owner
+
+An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
+- Defining criticality
+- Managing fine-tunes
+  - LoRA Adapters
+  - System Prompts
+  - Prompt Cache
+  - etc.
+- Managing rollout of adapters