You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/proposals/002-api-proposal/proposal.md
+54-86
Original file line number
Diff line number
Diff line change
@@ -28,13 +28,12 @@
28
28
29
29
## Summary
30
30
31
-
This proposal presents 2 new CRD objects to express the needs of the LLM Instance Gateway. **InferencePool** and **InferenceModel**. The InferencePool is the logical grouping of compute, owned by the Inference Platform Admin persona. While the InferenceModel defines the serving objectives of a specific model or LoRA adapter, and is owned by the Inference Workload Owner.
31
+
This proposal presents 2 new CRD objects to express the needs of the Gateway API Inference Extension. **InferencePool** and **InferenceModel**. The InferencePool is the logical grouping of compute, owned by the Inference Platform Admin persona. While the InferenceModel defines the serving objectives of a specific model or LoRA adapter, and is owned by the Inference Workload Owner.
32
32
33
-
**NOTE: Some routing terms are defined in the [glossary](./glossary.md) file, to more deeply describe how we will handle behaviors like priority and fairness**
34
33
35
34
## Goals
36
35
37
-
- Drive concensus on direction of LLM Instance Gateway Solution
36
+
- Drive concensus on direction of Gateway API Inference Extension Solution
38
37
- Documentation of API decisions for posterity
39
38
40
39
## Non-Goals
@@ -61,7 +60,7 @@ The Inference Platform Admin creates and manages the infrastructure necessary to
61
60
#### Inference Workload Owner
62
61
63
62
An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
64
-
- Defining importance
63
+
- Defining criticality
65
64
- Managing fine-tunes
66
65
- LoRA Adapters
67
66
- System Prompts
@@ -100,17 +99,44 @@ Additionally, any Pod that seeks to join an InferencePool would need to support
100
99
### InferenceModel
101
100
102
101
An InferenceModel allows the Inference Workload Owner to define:
103
-
- Which LoRA adapter(s) to consume
104
-
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out
105
-
- SLO objectives for the InferenceModel
106
-
- The Pools this InferenceModel is relevant to
102
+
- Which Model/LoRA adapter(s) to consume .
103
+
- Mapping from a client facing model name to the target model name in the InferencePool.
104
+
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out.
105
+
- Criticality of the requests to the InferenceModel.
106
+
- The InferencePools this InferenceModel is relevant to.
107
107
108
108
### Spec
109
109
110
+
**InferencePool**
111
+
```golang
112
+
// The InferencePool is a construct for pooling compute (often model servers) to
113
+
// serve large models, that have the ability to share capacity across multiple
114
+
// services (such as through prompt engineering, LoRA adapters, etc).
115
+
// InferencePools have a dependency on a Gateway that is compatible with ext-proc
116
+
// (External Processing). When a new InferencePool object is created, a new ext proc
117
+
// deployment is created. InferencePools require at minimum a single InferenceModel to
118
+
// be subscribed to them to accept traffic, any traffic with a model not
119
+
// defined within an InferenceModel will be rejected.
120
+
typeInferencePoolstruct {
121
+
metav1.ObjectMeta
122
+
metav1.TypeMeta
123
+
124
+
SpecInferencePoolSpec
125
+
}
126
+
127
+
typeInferencePoolSpecstruct {
128
+
// ModelServerSelector uses label selection to watch model server pods
129
+
// that should be included in the InferencePool. ModelServers should not
130
+
// be with any other Service or InferencePool, that behavior is not supported
This diagram lightly follows the example request for a model `name-generator`.
264
-
The flow can be described as:
265
-
- The request comes in to our routing solution(Ext-Proc)
266
-
- ExtProc looks up the InferenceModels affiliated with this pool `examplePool`
267
-
- `name-generator`is currently undergoing a change of LoRA adapters from `name-generator-v3` (20% traffic split) to `name-generator-v2` (80% traffic split)
268
-
- `name-generator-v2`is selected as the LoRA adapter, and replaces `name-generator` in the body of the request (mutated by ext-proc)
269
-
- the request is then efficiently scheduled onto one of the valid Pods
270
-
- Prometheus metrics are sent back to the LSP, aggregated and re-emitted via sidecar (following the metric standardization)
271
-
272
-
How Multiple InferencePools might integrate together:
273
-
274
-

275
-
276
-
Here we see that we can have:
277
-
- Multiple Routes pointing to the same pool
278
-
- Routes splitting traffic across multiple pools
279
-
280
-
The functionality of the Kubernetes Gateway is unchanged with this proposal, allowing seamless integration with the InferencePool.
281
-
282
261
283
262
### Alternatives
284
263
@@ -303,23 +282,12 @@ Our original idea was to define all InferenceModel config at the Kubernetes Gate
303
282
- Feasibly done - No extension of HttpRoute. Just works, as InferencePool operates like a service.
304
283
- Complexity is only expressed during transition states (model version upgrade)
305
284
- Keeps Pools self contained - multiple K8s gateways can direct traffic to the same pool without needing to re-express Pool-level behavior
306
-
- **What is a LSP attempting to define?**
285
+
- **What is an InferencePool attempting to define?**
307
286
- InferencePool groups resources that should be shared over the InferenceModels that are affiliated with the pool
308
287
- Best practice would also suggest keeping the same base model for all ModelServers in the pool, but that is not enforced
309
288
- **How is this deployed?**
310
289
- We will follow [common patterns](https://gateway.envoyproxy.io/docs/tasks/quickstart/#installation) to install the CRDs & Controllers
311
-
- **Are all controllers necessary for this solution going to be provided by Instance Gateway(this repo)?**
290
+
- **Are all controllers necessary for this solution going to be provided by this project?**
312
291
- Yes
313
292
314
293
315
-
316
-
317
-
## Open Questions
318
-
319
-
- Reasonable defaults (how do we behave in the absence of user-specified values in optional fields)
320
-
- Should services be required? Or can a customer simply create a pool, and direct requests to the pool, and expect even fairness/priority across the different LoRA adapters that are requested?
321
-
- If so? How should we handle the mix between explicit and implicit services? Are implicit InferenceModels just default everything? (and inherently lower prio).
322
-
- NOTE: Current thinking is this is yes we should allow non-use case defined requests, but is a security risk if on by default. So pools should opt-in
323
-
- Configuration control
324
-
- How many routing decisions should we make on behalf of the user vs allow for configuration?
325
-
- Do we decide that SLO adherence is stricter than Fairness adherence? Do we allow for configuration of such tooling? (would be expressed in the InferencePool API)
0 commit comments