Skip to content

Commit b75f937

Browse files
committed
Addressed comments
1 parent 0c8fdf3 commit b75f937

File tree

2 files changed

+34
-128
lines changed

2 files changed

+34
-128
lines changed

docs/proposals/002-api-proposal/glossary.md

-94
This file was deleted.

docs/proposals/002-api-proposal/proposal.md

+34-34
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ The Inference Platform Admin creates and manages the infrastructure necessary to
6060

6161
#### Inference Workload Owner
6262

63-
A Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
63+
An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
6464
- Defining importance
6565
- Managing fine-tunes
6666
- LoRA Adapters
@@ -95,11 +95,11 @@ It is _not_ expected for the InferencePool to:
9595
- Manage Deployments of Pods within the Pool
9696
- Manage Pod lifecycle of pods within the pool
9797

98-
Additionally, any Pod that seeks to join a InferencePool would need to support a protocol, defined by LLM Instance Gateway, to ensure the Pool has adequate information to intelligently route requests.
98+
Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.
9999

100100
### InferenceModel
101101

102-
A InferenceModel allows the Inference Workload Owner to define:
102+
An InferenceModel allows the Inference Workload Owner to define:
103103
- Which LoRA adapter(s) to consume
104104
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out
105105
- SLO objectives for the InferenceModel
@@ -115,7 +115,7 @@ A InferenceModel allows the Inference Workload Owner to define:
115115
// leverages a large language model from a model frontend, drives the lifecycle
116116
// and rollout of new versions of those models, and defines the specific
117117
// performance and latency goals for the model. These workloads are
118-
// expected to operate within a InferencePool sharing compute capacity with other
118+
// expected to operate within an InferencePool sharing compute capacity with other
119119
// InferenceModels, defined by the Inference Platform Admin. We allow a user who
120120
// has multiple InferenceModels across multiple pools (with the same config) to
121121
// specify the configuration exactly once, and deploy to many pools
@@ -150,8 +150,8 @@ type InferenceModelSpec struct {
150150
// If not specified, the target model name is defaulted to the modelName parameter.
151151
// modelName is often in reference to a LoRA adapter.
152152
TargetModels []TargetModel
153-
// Reference to the backend pools that the model registers to.
154-
PoolRef []corev1.ObjectReference
153+
// Reference to the InferencePool that the model registers to. It must exist in the same namespace.
154+
PoolRef InferencePoolReference
155155
}
156156

157157
// Defines how important it is to serve the model compared to other models.
@@ -180,6 +180,10 @@ type TargetModel struct {
180180
// sent to this target model when multiple versions of the model are specified.
181181
Weight int
182182
}
183+
184+
// InferencePoolReference is the name of the InferencePool.
185+
type InferencePoolReference string
186+
183187
```
184188

185189
**InferencePool**
@@ -191,7 +195,7 @@ type TargetModel struct {
191195
// (External Processing). When a new LSP object is created, a new ext proc
192196
// deployment is created. InferencePools require at minimum a single InferenceModel to
193197
// be subscribed to them to accept traffic, any traffic with a model not
194-
// definied within a InferenceModel will be rejected.
198+
// defined within an InferenceModel will be rejected.
195199
type InferencePool struct {
196200
metav1.ObjectMeta
197201
metav1.TypeMeta
@@ -211,44 +215,41 @@ type InferencePoolSpec struct {
211215
### Yaml Examples
212216

213217
#### InferencePool(s)
214-
Here we create 2 LSPs that subscribe to services to collect the appropriate pods
218+
Here we create a pool that selects the appropriate pods
215219
```yaml
216220
apiVersion: inference.x-k8s.io/v1alpha1
217221
kind: InferencePool
218222
metadata:
219-
name: llama-2-pool
220-
services:
221-
- llama-2-vllm
222-
---
223-
apiVersion: inference.x-k8s.io/v1alpha1
224-
kind: InferencePool
225-
metadata:
226-
name: gemini-pool
227-
services:
228-
- gemini-jetstream-tpu-v5e
229-
- gemini-vllm-a100
223+
name: base-model-pool
224+
modelServerSelector:
225+
- app: llm-server
230226
```
231227
232228
#### InferenceModel
233229
234-
Here we consume both pools with a single InferenceModel, while also specifying 2 InferenceModels. Where `sql-code-assist` is both the name of the ModelInferenceModel, and the name of the LoRA adapter on the model server. And `npc-bot` has a layer of indirection for those names, as well as a specified objective. Both `sql-code-assist` and `npc-bot` have available LoRA adapters on both InferencePools and routing to each InferencePool happens earlier(at the K8s Gateway). So traffic splitting between separate pools happens at the K8s Gateway.
230+
Here we consume the pool with two InferenceModels. Where `sql-code-assist` is both the name of the model and the name of the LoRA adapter on the model server. And `npc-bot` has a layer of indirection for those names, as well as a specified criticality. Both `sql-code-assist` and `npc-bot` have available LoRA adapters on the InferencePool and routing to each InferencePool happens earlier (at the K8s Gateway).
235231
```yaml
236232
apiVersion: inference.x-k8s.io/v1alpha1
237233
kind: InferenceModel
238234
metadata:
239-
name: my-llm-service
235+
name: sql-code-assist
236+
spec:
237+
modelName: sql-code-assist
238+
poolRef: base-model-pool
239+
---
240+
apiVersion: inference.x-k8s.io/v1alpha1
241+
kind: InferenceModel
242+
metadata:
243+
name: npc-bot
240244
spec:
241-
InferenceModels:
242-
- modelName: sql-code-assist
243-
- modelName: npc-bot
244-
targetModels:
245-
targetModelName: npc-bot-v1
246-
weight: 50
247-
targetModelName: npc-bot-v2
248-
weight: 50
249-
poolRef:
250-
- name: llama-2-pool
251-
- name: gemini-pool
245+
modelName: npc-bot
246+
criticality: Critical
247+
targetModels:
248+
targetModelName: npc-bot-v1
249+
weight: 50
250+
targetModelName: npc-bot-v2
251+
weight: 50
252+
poolRef: base-model-pool
252253
```
253254

254255
### Diagrams
@@ -291,7 +292,7 @@ Our alternatives hinge on some key decisions:
291292

292293
#### InferenceModel as a backend ref
293294

294-
We toyed with the idea of allowing an InferenceModel be the target of an HTTPRouteRules backend ref. However, doing so would require the Kubernetes Gateway to be able to interpret body level parameters (assuming OpenAI protocol continues to require the model param in the body), and require that the HTTPRoute also specify the backend the InferenceModel is intended to run on. Since we our primary proposal already specifies the backend, packing this functionality would require substantial work on the Kubernetes Gateway, while not providing much flexibility.
295+
We toyed with the idea of allowing an InferenceModel be the target of an HTTPRouteRules backend ref. However, doing so would require the Kubernetes Gateway to be able to interpret body level parameters (assuming OpenAI protocol continues to require the model param in the body), and require that the HTTPRoute also specify the backend the InferenceModel is intended to run on. Since our primary proposal already specifies the backend, packing this functionality would require substantial work on the Kubernetes Gateway, while not providing much flexibility.
295296

296297
#### LLMRoute
297298

@@ -305,7 +306,6 @@ Our original idea was to define all InferenceModel config at the Kubernetes Gate
305306
- **What is a LSP attempting to define?**
306307
- InferencePool groups resources that should be shared over the InferenceModels that are affiliated with the pool
307308
- Best practice would also suggest keeping the same base model for all ModelServers in the pool, but that is not enforced
308-
- **Can a InferenceModel reference multiple LSPs?**
309309
- **How is this deployed?**
310310
- We will follow [common patterns](https://gateway.envoyproxy.io/docs/tasks/quickstart/#installation) to install the CRDs & Controllers
311311
- **Are all controllers necessary for this solution going to be provided by Instance Gateway(this repo)?**

0 commit comments

Comments
 (0)