You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/proposals/002-api-proposal/proposal.md
+34-34
Original file line number
Diff line number
Diff line change
@@ -60,7 +60,7 @@ The Inference Platform Admin creates and manages the infrastructure necessary to
60
60
61
61
#### Inference Workload Owner
62
62
63
-
A Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
63
+
An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
64
64
- Defining importance
65
65
- Managing fine-tunes
66
66
- LoRA Adapters
@@ -95,11 +95,11 @@ It is _not_ expected for the InferencePool to:
95
95
- Manage Deployments of Pods within the Pool
96
96
- Manage Pod lifecycle of pods within the pool
97
97
98
-
Additionally, any Pod that seeks to join a InferencePool would need to support a protocol, defined by LLM Instance Gateway, to ensure the Pool has adequate information to intelligently route requests.
98
+
Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.
99
99
100
100
### InferenceModel
101
101
102
-
A InferenceModel allows the Inference Workload Owner to define:
102
+
An InferenceModel allows the Inference Workload Owner to define:
103
103
- Which LoRA adapter(s) to consume
104
104
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out
105
105
- SLO objectives for the InferenceModel
@@ -115,7 +115,7 @@ A InferenceModel allows the Inference Workload Owner to define:
115
115
// leverages a large language model from a model frontend, drives the lifecycle
116
116
// and rollout of new versions of those models, and defines the specific
117
117
// performance and latency goals for the model. These workloads are
118
-
// expected to operate within a InferencePool sharing compute capacity with other
118
+
// expected to operate within an InferencePool sharing compute capacity with other
119
119
// InferenceModels, defined by the Inference Platform Admin. We allow a user who
120
120
// has multiple InferenceModels across multiple pools (with the same config) to
121
121
// specify the configuration exactly once, and deploy to many pools
@@ -150,8 +150,8 @@ type InferenceModelSpec struct {
150
150
// If not specified, the target model name is defaulted to the modelName parameter.
151
151
// modelName is often in reference to a LoRA adapter.
152
152
TargetModels []TargetModel
153
-
// Reference to the backend pools that the model registers to.
154
-
PoolRef[]corev1.ObjectReference
153
+
// Reference to the InferencePool that the model registers to. It must exist in the same namespace.
154
+
PoolRefInferencePoolReference
155
155
}
156
156
157
157
// Defines how important it is to serve the model compared to other models.
@@ -180,6 +180,10 @@ type TargetModel struct {
180
180
// sent to this target model when multiple versions of the model are specified.
181
181
Weightint
182
182
}
183
+
184
+
// InferencePoolReference is the name of the InferencePool.
185
+
typeInferencePoolReferencestring
186
+
183
187
```
184
188
185
189
**InferencePool**
@@ -191,7 +195,7 @@ type TargetModel struct {
191
195
// (External Processing). When a new LSP object is created, a new ext proc
192
196
// deployment is created. InferencePools require at minimum a single InferenceModel to
193
197
// be subscribed to them to accept traffic, any traffic with a model not
194
-
//definied within a InferenceModel will be rejected.
198
+
//defined within an InferenceModel will be rejected.
195
199
typeInferencePoolstruct {
196
200
metav1.ObjectMeta
197
201
metav1.TypeMeta
@@ -211,44 +215,41 @@ type InferencePoolSpec struct {
211
215
### Yaml Examples
212
216
213
217
#### InferencePool(s)
214
-
Here we create 2 LSPs that subscribe to services to collect the appropriate pods
218
+
Here we create a pool that selects the appropriate pods
215
219
```yaml
216
220
apiVersion: inference.x-k8s.io/v1alpha1
217
221
kind: InferencePool
218
222
metadata:
219
-
name: llama-2-pool
220
-
services:
221
-
- llama-2-vllm
222
-
---
223
-
apiVersion: inference.x-k8s.io/v1alpha1
224
-
kind: InferencePool
225
-
metadata:
226
-
name: gemini-pool
227
-
services:
228
-
- gemini-jetstream-tpu-v5e
229
-
- gemini-vllm-a100
223
+
name: base-model-pool
224
+
modelServerSelector:
225
+
- app: llm-server
230
226
```
231
227
232
228
#### InferenceModel
233
229
234
-
Here we consume both pools with a single InferenceModel, while also specifying 2 InferenceModels. Where `sql-code-assist` is both the name of the ModelInferenceModel, and the name of the LoRA adapter on the model server. And `npc-bot` has a layer of indirection for those names, as well as a specified objective. Both `sql-code-assist` and `npc-bot` have available LoRA adapters on both InferencePools and routing to each InferencePool happens earlier(at the K8s Gateway). So traffic splitting between separate pools happens at the K8s Gateway.
230
+
Here we consume the pool with two InferenceModels. Where `sql-code-assist` is both the name of the model and the name of the LoRA adapter on the model server. And `npc-bot` has a layer of indirection for those names, as well as a specified criticality. Both `sql-code-assist` and `npc-bot` have available LoRA adapters on the InferencePool and routing to each InferencePool happens earlier(at the K8s Gateway).
235
231
```yaml
236
232
apiVersion: inference.x-k8s.io/v1alpha1
237
233
kind: InferenceModel
238
234
metadata:
239
-
name: my-llm-service
235
+
name: sql-code-assist
236
+
spec:
237
+
modelName: sql-code-assist
238
+
poolRef: base-model-pool
239
+
---
240
+
apiVersion: inference.x-k8s.io/v1alpha1
241
+
kind: InferenceModel
242
+
metadata:
243
+
name: npc-bot
240
244
spec:
241
-
InferenceModels:
242
-
- modelName: sql-code-assist
243
-
- modelName: npc-bot
244
-
targetModels:
245
-
targetModelName: npc-bot-v1
246
-
weight: 50
247
-
targetModelName: npc-bot-v2
248
-
weight: 50
249
-
poolRef:
250
-
- name: llama-2-pool
251
-
- name: gemini-pool
245
+
modelName: npc-bot
246
+
criticality: Critical
247
+
targetModels:
248
+
targetModelName: npc-bot-v1
249
+
weight: 50
250
+
targetModelName: npc-bot-v2
251
+
weight: 50
252
+
poolRef: base-model-pool
252
253
```
253
254
254
255
### Diagrams
@@ -291,7 +292,7 @@ Our alternatives hinge on some key decisions:
291
292
292
293
#### InferenceModel as a backend ref
293
294
294
-
We toyed with the idea of allowing an InferenceModel be the target of an HTTPRouteRules backend ref. However, doing so would require the Kubernetes Gateway to be able to interpret body level parameters (assuming OpenAI protocol continues to require the model param in the body), and require that the HTTPRoute also specify the backend the InferenceModel is intended to run on. Since we our primary proposal already specifies the backend, packing this functionality would require substantial work on the Kubernetes Gateway, while not providing much flexibility.
295
+
We toyed with the idea of allowing an InferenceModel be the target of an HTTPRouteRules backend ref. However, doing so would require the Kubernetes Gateway to be able to interpret body level parameters (assuming OpenAI protocol continues to require the model param in the body), and require that the HTTPRoute also specify the backend the InferenceModel is intended to run on. Since our primary proposal already specifies the backend, packing this functionality would require substantial work on the Kubernetes Gateway, while not providing much flexibility.
295
296
296
297
#### LLMRoute
297
298
@@ -305,7 +306,6 @@ Our original idea was to define all InferenceModel config at the Kubernetes Gate
305
306
- **What is a LSP attempting to define?**
306
307
- InferencePool groups resources that should be shared over the InferenceModels that are affiliated with the pool
307
308
- Best practice would also suggest keeping the same base model for all ModelServers in the pool, but that is not enforced
308
-
- **Can a InferenceModel reference multiple LSPs?**
309
309
- **How is this deployed?**
310
310
- We will follow [common patterns](https://gateway.envoyproxy.io/docs/tasks/quickstart/#installation) to install the CRDs & Controllers
311
311
- **Are all controllers necessary for this solution going to be provided by Instance Gateway(this repo)?**
0 commit comments