Skip to content

Commit de28b16

Browse files
committed
Addressing comments round 2
1 parent b75f937 commit de28b16

File tree

1 file changed

+47
-78
lines changed

1 file changed

+47
-78
lines changed

docs/proposals/002-api-proposal/proposal.md

+47-78
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,8 @@
2828

2929
## Summary
3030

31-
This proposal presents 2 new CRD objects to express the needs of the LLM Instance Gateway. **InferencePool** and **InferenceModel**. The InferencePool is the logical grouping of compute, owned by the Inference Platform Admin persona. While the InferenceModel defines the serving objectives of a specific model or LoRA adapter, and is owned by the Inference Workload Owner.
31+
This proposal presents 2 new CRD objects to express the needs of the Gateway API Inference Extension. **InferencePool** and **InferenceModel**. The InferencePool is the logical grouping of compute, owned by the Inference Platform Admin persona. While the InferenceModel defines the serving objectives of a specific model or LoRA adapter, and is owned by the Inference Workload Owner.
3232

33-
**NOTE: Some routing terms are defined in the [glossary](./glossary.md) file, to more deeply describe how we will handle behaviors like priority and fairness**
3433

3534
## Goals
3635

@@ -61,7 +60,7 @@ The Inference Platform Admin creates and manages the infrastructure necessary to
6160
#### Inference Workload Owner
6261

6362
An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
64-
- Defining importance
63+
- Defining criticality
6564
- Managing fine-tunes
6665
- LoRA Adapters
6766
- System Prompts
@@ -100,17 +99,44 @@ Additionally, any Pod that seeks to join an InferencePool would need to support
10099
### InferenceModel
101100

102101
An InferenceModel allows the Inference Workload Owner to define:
103-
- Which LoRA adapter(s) to consume
104-
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out
105-
- SLO objectives for the InferenceModel
106-
- The Pools this InferenceModel is relevant to
102+
- Which Model/LoRA adapter(s) to consume .
103+
- Mapping from a client facing model name to the target model name in the InferencePool.
104+
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out.
105+
- Criticality of the requests to the InferenceModel.
106+
- The InferencePools this InferenceModel is relevant to.
107107

108108
### Spec
109109

110+
**InferencePool**
111+
```golang
112+
// The InferencePool is a construct for pooling compute (often model servers) to
113+
// serve large models, that have the ability to share capacity across multiple
114+
// services (such as through prompt engineering, LoRA adapters, etc).
115+
// InferencePools have a dependency on a Gateway that is compatible with ext-proc
116+
// (External Processing). When a new InferencePool object is created, a new ext proc
117+
// deployment is created. InferencePools require at minimum a single InferenceModel to
118+
// be subscribed to them to accept traffic, any traffic with a model not
119+
// defined within an InferenceModel will be rejected.
120+
type InferencePool struct {
121+
metav1.ObjectMeta
122+
metav1.TypeMeta
123+
124+
Spec InferencePoolSpec
125+
}
126+
127+
type InferencePoolSpec struct {
128+
// ModelServerSelector uses label selection to watch model server pods
129+
// that should be included in the InferencePool. ModelServers should not
130+
// be with any other Service or InferencePool, that behavior is not supported
131+
// and will result in sub-optimal utilization.
132+
ModelServerSelector map[string]string `json:"modelServerSelector,omitempty"`
133+
}
134+
```
135+
110136
**InferenceModel**
111137
```golang
112138
// InferenceModel represents a set of Models/Adapters that are multiplexed onto one
113-
// or more backend pools. This resource is managed by the "Inference Workload Owner"
139+
// or more Inferencepools. This resource is managed by the "Inference Workload Owner"
114140
// persona. The Inference Workload Owner persona is: a team that trains, verifies, and
115141
// leverages a large language model from a model frontend, drives the lifecycle
116142
// and rollout of new versions of those models, and defines the specific
@@ -120,11 +146,7 @@ An InferenceModel allows the Inference Workload Owner to define:
120146
// has multiple InferenceModels across multiple pools (with the same config) to
121147
// specify the configuration exactly once, and deploy to many pools
122148
// simultaneously. Enabling a simpler config and single source of truth
123-
// for a given user. InferenceModel names are unique for a given InferencePool,
124-
// if the name is reused, an error will be shown on the status of a
125-
// InferenceModel that attempted to reuse. The oldest InferenceModel, based on
126-
// creation timestamp, will be selected to remain valid. In the event of a race
127-
// condition, one will be selected at random.
149+
// for a given user. InferenceModel ModelNames are unique for a given InferencePool,
128150
type InferenceModel struct {
129151
metav1.ObjectMeta
130152
metav1.TypeMeta
@@ -151,7 +173,7 @@ type InferenceModelSpec struct {
151173
// modelName is often in reference to a LoRA adapter.
152174
TargetModels []TargetModel
153175
// Reference to the InferencePool that the model registers to. It must exist in the same namespace.
154-
PoolRef InferencePoolReference
176+
PoolReference *LocalObjectReference
155177
}
156178

157179
// Defines how important it is to serve the model compared to other models.
@@ -181,35 +203,20 @@ type TargetModel struct {
181203
Weight int
182204
}
183205

184-
// InferencePoolReference is the name of the InferencePool.
185-
type InferencePoolReference string
206+
// LocalObjectReference identifies an API object within the namespace of the
207+
// referrer.
208+
type LocalObjectReference struct {
209+
// Group is the group of the referent. For example, "gateway.networking.k8s.io".
210+
// When unspecified or empty string, core API group is inferred.
211+
Group Group
186212

187-
```
213+
// Kind is kind of the referent. For example "HTTPRoute" or "Service".
214+
Kind Kind
188215

189-
**InferencePool**
190-
```golang
191-
// The InferencePool is a construct for pooling compute (often model servers) to
192-
// serve large models, that have the ability to share capacity across multiple
193-
// services (such as through prompt engineering, LoRA adapters, etc).
194-
// InferencePools have a dependency on a Gateway that is compatible with ext-proc
195-
// (External Processing). When a new LSP object is created, a new ext proc
196-
// deployment is created. InferencePools require at minimum a single InferenceModel to
197-
// be subscribed to them to accept traffic, any traffic with a model not
198-
// defined within an InferenceModel will be rejected.
199-
type InferencePool struct {
200-
metav1.ObjectMeta
201-
metav1.TypeMeta
202-
203-
Spec InferencePoolSpec
216+
// Name is the name of the referent.
217+
Name ObjectName
204218
}
205219

206-
type InferencePoolSpec struct {
207-
// ModelServerSelector uses label selection to watch model server pods
208-
// that should be included in the InferencePool. ModelServers should not
209-
// be with any other Service or InferencePool, that behavior is not supported
210-
// and will result in sub-optimal utilization.
211-
ModelServerSelector map[string]string `json:"modelServerSelector,omitempty"`
212-
}
213220
```
214221

215222
### Yaml Examples
@@ -252,33 +259,6 @@ spec:
252259
poolRef: base-model-pool
253260
```
254261

255-
### Diagrams
256-
257-
Much of this is better explained visually:
258-
259-
Below is a detailed view of the InferencePool
260-
261-
![InferencePool](./images/lsp.svg)
262-
263-
This diagram lightly follows the example request for a model `name-generator`.
264-
The flow can be described as:
265-
- The request comes in to our routing solution(Ext-Proc)
266-
- ExtProc looks up the InferenceModels affiliated with this pool `examplePool`
267-
- `name-generator` is currently undergoing a change of LoRA adapters from `name-generator-v3` (20% traffic split) to `name-generator-v2` (80% traffic split)
268-
- `name-generator-v2` is selected as the LoRA adapter, and replaces `name-generator` in the body of the request (mutated by ext-proc)
269-
- the request is then efficiently scheduled onto one of the valid Pods
270-
- Prometheus metrics are sent back to the LSP, aggregated and re-emitted via sidecar (following the metric standardization)
271-
272-
How Multiple InferencePools might integrate together:
273-
274-
![K8s Gateway with InferencePools](./images/gw_w_lsp.svg)
275-
276-
Here we see that we can have:
277-
- Multiple Routes pointing to the same pool
278-
- Routes splitting traffic across multiple pools
279-
280-
The functionality of the Kubernetes Gateway is unchanged with this proposal, allowing seamless integration with the InferencePool.
281-
282262

283263
### Alternatives
284264

@@ -303,7 +283,7 @@ Our original idea was to define all InferenceModel config at the Kubernetes Gate
303283
- Feasibly done - No extension of HttpRoute. Just works, as InferencePool operates like a service.
304284
- Complexity is only expressed during transition states (model version upgrade)
305285
- Keeps Pools self contained - multiple K8s gateways can direct traffic to the same pool without needing to re-express Pool-level behavior
306-
- **What is a LSP attempting to define?**
286+
- **What is an InferencePool attempting to define?**
307287
- InferencePool groups resources that should be shared over the InferenceModels that are affiliated with the pool
308288
- Best practice would also suggest keeping the same base model for all ModelServers in the pool, but that is not enforced
309289
- **How is this deployed?**
@@ -312,14 +292,3 @@ Our original idea was to define all InferenceModel config at the Kubernetes Gate
312292
- Yes
313293

314294

315-
316-
317-
## Open Questions
318-
319-
- Reasonable defaults (how do we behave in the absence of user-specified values in optional fields)
320-
- Should services be required? Or can a customer simply create a pool, and direct requests to the pool, and expect even fairness/priority across the different LoRA adapters that are requested?
321-
- If so? How should we handle the mix between explicit and implicit services? Are implicit InferenceModels just default everything? (and inherently lower prio).
322-
- NOTE: Current thinking is this is yes we should allow non-use case defined requests, but is a security risk if on by default. So pools should opt-in
323-
- Configuration control
324-
- How many routing decisions should we make on behalf of the user vs allow for configuration?
325-
- Do we decide that SLO adherence is stricter than Fairness adherence? Do we allow for configuration of such tooling? (would be expressed in the InferencePool API)

0 commit comments

Comments
 (0)