Skip to content

Commit 0aae37b

Browse files
committed
Addressing comments round 2
1 parent b75f937 commit 0aae37b

File tree

1 file changed

+54
-86
lines changed

1 file changed

+54
-86
lines changed

docs/proposals/002-api-proposal/proposal.md

+54-86
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,12 @@
2828

2929
## Summary
3030

31-
This proposal presents 2 new CRD objects to express the needs of the LLM Instance Gateway. **InferencePool** and **InferenceModel**. The InferencePool is the logical grouping of compute, owned by the Inference Platform Admin persona. While the InferenceModel defines the serving objectives of a specific model or LoRA adapter, and is owned by the Inference Workload Owner.
31+
This proposal presents 2 new CRD objects to express the needs of the Gateway API Inference Extension. **InferencePool** and **InferenceModel**. The InferencePool is the logical grouping of compute, owned by the Inference Platform Admin persona. While the InferenceModel defines the serving objectives of a specific model or LoRA adapter, and is owned by the Inference Workload Owner.
3232

33-
**NOTE: Some routing terms are defined in the [glossary](./glossary.md) file, to more deeply describe how we will handle behaviors like priority and fairness**
3433

3534
## Goals
3635

37-
- Drive concensus on direction of LLM Instance Gateway Solution
36+
- Drive concensus on direction of Gateway API Inference Extension Solution
3837
- Documentation of API decisions for posterity
3938

4039
## Non-Goals
@@ -61,7 +60,7 @@ The Inference Platform Admin creates and manages the infrastructure necessary to
6160
#### Inference Workload Owner
6261

6362
An Inference Workload Owner persona owns and manages 1 or many Generative AI Workloads (LLM focused *currently*). This includes:
64-
- Defining importance
63+
- Defining criticality
6564
- Managing fine-tunes
6665
- LoRA Adapters
6766
- System Prompts
@@ -100,17 +99,44 @@ Additionally, any Pod that seeks to join an InferencePool would need to support
10099
### InferenceModel
101100

102101
An InferenceModel allows the Inference Workload Owner to define:
103-
- Which LoRA adapter(s) to consume
104-
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out
105-
- SLO objectives for the InferenceModel
106-
- The Pools this InferenceModel is relevant to
102+
- Which Model/LoRA adapter(s) to consume .
103+
- Mapping from a client facing model name to the target model name in the InferencePool.
104+
- InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out.
105+
- Criticality of the requests to the InferenceModel.
106+
- The InferencePools this InferenceModel is relevant to.
107107

108108
### Spec
109109

110+
**InferencePool**
111+
```golang
112+
// The InferencePool is a construct for pooling compute (often model servers) to
113+
// serve large models, that have the ability to share capacity across multiple
114+
// services (such as through prompt engineering, LoRA adapters, etc).
115+
// InferencePools have a dependency on a Gateway that is compatible with ext-proc
116+
// (External Processing). When a new InferencePool object is created, a new ext proc
117+
// deployment is created. InferencePools require at minimum a single InferenceModel to
118+
// be subscribed to them to accept traffic, any traffic with a model not
119+
// defined within an InferenceModel will be rejected.
120+
type InferencePool struct {
121+
metav1.ObjectMeta
122+
metav1.TypeMeta
123+
124+
Spec InferencePoolSpec
125+
}
126+
127+
type InferencePoolSpec struct {
128+
// ModelServerSelector uses label selection to watch model server pods
129+
// that should be included in the InferencePool. ModelServers should not
130+
// be with any other Service or InferencePool, that behavior is not supported
131+
// and will result in sub-optimal utilization.
132+
ModelServerSelector map[string]string `json:"modelServerSelector,omitempty"`
133+
}
134+
```
135+
110136
**InferenceModel**
111137
```golang
112138
// InferenceModel represents a set of Models/Adapters that are multiplexed onto one
113-
// or more backend pools. This resource is managed by the "Inference Workload Owner"
139+
// or more Inferencepools. This resource is managed by the "Inference Workload Owner"
114140
// persona. The Inference Workload Owner persona is: a team that trains, verifies, and
115141
// leverages a large language model from a model frontend, drives the lifecycle
116142
// and rollout of new versions of those models, and defines the specific
@@ -120,11 +146,7 @@ An InferenceModel allows the Inference Workload Owner to define:
120146
// has multiple InferenceModels across multiple pools (with the same config) to
121147
// specify the configuration exactly once, and deploy to many pools
122148
// simultaneously. Enabling a simpler config and single source of truth
123-
// for a given user. InferenceModel names are unique for a given InferencePool,
124-
// if the name is reused, an error will be shown on the status of a
125-
// InferenceModel that attempted to reuse. The oldest InferenceModel, based on
126-
// creation timestamp, will be selected to remain valid. In the event of a race
127-
// condition, one will be selected at random.
149+
// for a given user. InferenceModel ModelNames are unique for a given InferencePool,
128150
type InferenceModel struct {
129151
metav1.ObjectMeta
130152
metav1.TypeMeta
@@ -147,11 +169,11 @@ type InferenceModelSpec struct {
147169
Criticality *Criticality
148170
// Optional.
149171
// Allow multiple versions of a model for traffic splitting.
150-
// If not specified, the target model name is defaulted to the modelName parameter.
151-
// modelName is often in reference to a LoRA adapter.
172+
// If not specified, the target model name is defaulted to the ModelName parameter.
173+
// ModelName is often in reference to a LoRA adapter.
152174
TargetModels []TargetModel
153175
// Reference to the InferencePool that the model registers to. It must exist in the same namespace.
154-
PoolRef InferencePoolReference
176+
PoolReference *LocalObjectReference
155177
}
156178

157179
// Defines how important it is to serve the model compared to other models.
@@ -168,11 +190,11 @@ const (
168190

169191
// TargetModel represents a deployed model or a LoRA adapter. The
170192
// Name field is expected to match the name of the LoRA adapter
171-
// (or base model) as it is registered within the model server. Inference
172-
// Gateway assumes that the model exists on the model server and is the
193+
// (or base model) as it is registered within the model server. This
194+
// assumes that the model exists on the model server and it is the
173195
// responsibility of the user to validate a correct match. Should a model fail
174-
// to exist at request time, the error is processed by the Instance Gateway,
175-
// and then emitted on the appropriate InferenceModel object.
196+
// to exist at request time, the error is processed by the extension,
197+
// and then emitted on the appropriate InferenceModel object status.
176198
type TargetModel struct {
177199
// The name of the adapter as expected by the ModelServer.
178200
Name string
@@ -181,35 +203,19 @@ type TargetModel struct {
181203
Weight int
182204
}
183205

184-
// InferencePoolReference is the name of the InferencePool.
185-
type InferencePoolReference string
206+
// LocalObjectReference identifies an API object within the namespace of the
207+
// referrer.
208+
type LocalObjectReference struct {
209+
// Group is the group of the referent.
210+
Group Group
186211

187-
```
212+
// Kind is kind of the referent. For example "InferencePool".
213+
Kind Kind
188214

189-
**InferencePool**
190-
```golang
191-
// The InferencePool is a construct for pooling compute (often model servers) to
192-
// serve large models, that have the ability to share capacity across multiple
193-
// services (such as through prompt engineering, LoRA adapters, etc).
194-
// InferencePools have a dependency on a Gateway that is compatible with ext-proc
195-
// (External Processing). When a new LSP object is created, a new ext proc
196-
// deployment is created. InferencePools require at minimum a single InferenceModel to
197-
// be subscribed to them to accept traffic, any traffic with a model not
198-
// defined within an InferenceModel will be rejected.
199-
type InferencePool struct {
200-
metav1.ObjectMeta
201-
metav1.TypeMeta
202-
203-
Spec InferencePoolSpec
215+
// Name is the name of the referent.
216+
Name ObjectName
204217
}
205218

206-
type InferencePoolSpec struct {
207-
// ModelServerSelector uses label selection to watch model server pods
208-
// that should be included in the InferencePool. ModelServers should not
209-
// be with any other Service or InferencePool, that behavior is not supported
210-
// and will result in sub-optimal utilization.
211-
ModelServerSelector map[string]string `json:"modelServerSelector,omitempty"`
212-
}
213219
```
214220

215221
### Yaml Examples
@@ -252,33 +258,6 @@ spec:
252258
poolRef: base-model-pool
253259
```
254260

255-
### Diagrams
256-
257-
Much of this is better explained visually:
258-
259-
Below is a detailed view of the InferencePool
260-
261-
![InferencePool](./images/lsp.svg)
262-
263-
This diagram lightly follows the example request for a model `name-generator`.
264-
The flow can be described as:
265-
- The request comes in to our routing solution(Ext-Proc)
266-
- ExtProc looks up the InferenceModels affiliated with this pool `examplePool`
267-
- `name-generator` is currently undergoing a change of LoRA adapters from `name-generator-v3` (20% traffic split) to `name-generator-v2` (80% traffic split)
268-
- `name-generator-v2` is selected as the LoRA adapter, and replaces `name-generator` in the body of the request (mutated by ext-proc)
269-
- the request is then efficiently scheduled onto one of the valid Pods
270-
- Prometheus metrics are sent back to the LSP, aggregated and re-emitted via sidecar (following the metric standardization)
271-
272-
How Multiple InferencePools might integrate together:
273-
274-
![K8s Gateway with InferencePools](./images/gw_w_lsp.svg)
275-
276-
Here we see that we can have:
277-
- Multiple Routes pointing to the same pool
278-
- Routes splitting traffic across multiple pools
279-
280-
The functionality of the Kubernetes Gateway is unchanged with this proposal, allowing seamless integration with the InferencePool.
281-
282261

283262
### Alternatives
284263

@@ -303,23 +282,12 @@ Our original idea was to define all InferenceModel config at the Kubernetes Gate
303282
- Feasibly done - No extension of HttpRoute. Just works, as InferencePool operates like a service.
304283
- Complexity is only expressed during transition states (model version upgrade)
305284
- Keeps Pools self contained - multiple K8s gateways can direct traffic to the same pool without needing to re-express Pool-level behavior
306-
- **What is a LSP attempting to define?**
285+
- **What is an InferencePool attempting to define?**
307286
- InferencePool groups resources that should be shared over the InferenceModels that are affiliated with the pool
308287
- Best practice would also suggest keeping the same base model for all ModelServers in the pool, but that is not enforced
309288
- **How is this deployed?**
310289
- We will follow [common patterns](https://gateway.envoyproxy.io/docs/tasks/quickstart/#installation) to install the CRDs & Controllers
311-
- **Are all controllers necessary for this solution going to be provided by Instance Gateway(this repo)?**
290+
- **Are all controllers necessary for this solution going to be provided by this project?**
312291
- Yes
313292

314293

315-
316-
317-
## Open Questions
318-
319-
- Reasonable defaults (how do we behave in the absence of user-specified values in optional fields)
320-
- Should services be required? Or can a customer simply create a pool, and direct requests to the pool, and expect even fairness/priority across the different LoRA adapters that are requested?
321-
- If so? How should we handle the mix between explicit and implicit services? Are implicit InferenceModels just default everything? (and inherently lower prio).
322-
- NOTE: Current thinking is this is yes we should allow non-use case defined requests, but is a security risk if on by default. So pools should opt-in
323-
- Configuration control
324-
- How many routing decisions should we make on behalf of the user vs allow for configuration?
325-
- Do we decide that SLO adherence is stricter than Fairness adherence? Do we allow for configuration of such tooling? (would be expressed in the InferencePool API)

0 commit comments

Comments
 (0)