-
Notifications
You must be signed in to change notification settings - Fork 84
[WIP] Scheduler Subsystem interface #845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,34 @@ | ||
#names are egregiously long, but attempting to descibe custom logic within a name | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there are two different points to address here:
bottom line - I would leave configuration file out of scope at this point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually disagree here. Your other comment, makes the point that a SchedulingProfile is more of a struct, and I think that actually emphasizes the need to have a config file. B/C if a profile is a struct, it needs a way for a user to express how that struct is populated. If we don't provide config, we demand that anyone developing their algo rebuild their binary every single time they make a change, rather than update a config file and redeploy the EPP. I think the sooner much of this can be expressed outside of a new build, the better the experience, & the higher the rate of adoption.
This is a great design question to bring up & something we should hash out. Should each profile have it's own separate set of configurable knobs (example: kv_cache_util). I could see that having merit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
just to clarify - of course I agree that a configuration file is needed. I was trying to make a point that same as a big PR is much more convenient to discuss when broken into smaller PRs, I think these are two completely separate topics and we could benefit from separating the discussions. one topic is having a multi cycle scheduler and another topic is how to express the scheduler configuration in a file including the out of tree plugins handling. each of these topics is big on its own and can be solved separately. |
||
profileSelection: disagg-token-length | ||
schedulingResult: log-shadowbox-label-pd-result | ||
profiles: | ||
prefill: | ||
preschedule: | ||
- decode-prefix-cache-check | ||
filter: | ||
- is-prefill | ||
- has-required-accelerator | ||
score: | ||
- prefix-cache: 3 | ||
- latency-scorer: 2 | ||
selection: | ||
- best-score | ||
postschedule: | ||
- log-full-scores | ||
decode: | ||
filter: | ||
- is-decode | ||
score: | ||
- prefix-cache: 3 | ||
- kv-cache-util: 5 | ||
selection: | ||
- random-top-3 | ||
shadowbox-decode: | ||
filter: | ||
- is-decode | ||
- is-tpu | ||
score: | ||
- prefix-cache-v2: 4 | ||
- kv-cache-util: 1 | ||
selection: | ||
- random-top-3 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
/* | ||
Copyright 2025 The Kubernetes Authors. | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); | ||
you may not use this file except in compliance with the License. | ||
You may obtain a copy of the License at | ||
|
||
http://www.apache.org/licenses/LICENSE-2.0 | ||
|
||
Unless required by applicable law or agreed to in writing, software | ||
distributed under the License is distributed on an "AS IS" BASIS, | ||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
See the License for the specific language governing permissions and | ||
limitations under the License. | ||
*/ | ||
|
||
package framework | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This file feels like we’re trying to reinvent the wheel. many of these interfaces already exist while here there are a lot of missing things (that were updated already in existing interfaces). for example - no request headers, the return values are wrong, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an attempt to have a reference to the refactoring currently being done to make the scheduler pluggable. We are currently doing that over the individual PRs in an adhoc manner, this is an attempt to organize that a bit. We can move to a doc, we already have one that we can utilize, but eventually this should make it back to the repo as a canonical proposal. What I am roughly hoping we align on before making further refactoring is:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yup, I think we’re aligned on the final goal. just not sure that iterating over go interfaces is better than discussing on a doc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Not really, this is an iteration on what we have currently proposed that scrubs out opinionated decisions made (such as usage of K8s objects) which is not general.
Both convey the idea of generalization & how it would be approached. The interface file is intentionally in a proposal to indicate that it's simply for the sake of discussion.
the return values are different
A scheduler subsystem should not care about a higher level protocol. For instance, a scheduler should be able to be implemented & ran on a gRPC protocol that prefers to use metadata over headers. A neutral data structure disconnects any dependency on HTTP.
there are a lot of things that changed Agreed that we could continue discussion in a doc. This is WIP so people could take a look while I further populate the markdown doc with comments, constraints, & thoughts. |
||
|
||
import ( | ||
"context" | ||
"sync" | ||
|
||
scheduling "sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/types" | ||
) | ||
|
||
// Plugin is the parent type for all the scheduling framework plugins. | ||
type Plugin interface { | ||
Name() string | ||
} | ||
Comment on lines
+27
to
+29
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. just a thought - |
||
|
||
type Endpoint interface { | ||
GetState() EndpointState | ||
GetScore() float32 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. score is NOT a property of an endpoint. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is you prefer a different name, that is fine, but ultimately each endpoint in a Scheduling Cycle should have a score associated with it, however that is done. The scope of this field would be contained to the scheduler subsystem. So while I agree this is not a field on a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. today we have in a cycle a map[endpoint]float64 which lives in the scope of a cycle and is expressing the score of an endpoint during the cycle. I don’t see a need for a struct/interface with SetScore/GetScore |
||
SetScore(val float32) | ||
} | ||
|
||
type EndpointState struct { | ||
// only need to use a sync.Map if we do not plan on snapshotting data. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. snapshotting data sounds like a good thing, otherwise we might have inconsistencies between different points of the same cycle or between different cycles. |
||
storage sync.Map | ||
} | ||
|
||
type SchedulingResult struct { | ||
results map[string][]Endpoint | ||
} | ||
Comment on lines
+42
to
+44
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this type trying to encapsulate the multi-cycle results and map profileName -> it's result? this makes it clearer that each Profile cycle has it's own result. additionally, a cycle result might have more than []Endpoint as a return value e.g., if we have equally scored endpoints and additional endpoints with very close score - we might want to have bestScoreEndpoints and backupEndpoints, and return them all to the PostSchedule to let the plugin decide which endpoint to use. each cycle is making a local optimized endpoint selection. when receiving all cycles selection, we may want to do global optimization. for example in P/D (cycle for P and cycle for D) - |
||
|
||
type Scheduler interface { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I completely disagree on the Scheduler functions. to break it down more, when calling to scheduler - what I'm expecting to see here is something along these lines: Schedule(ctx context.Context, *schedulingtypes.Request) (result []*schedulingtypes.Result, err error) and internally in the scheduler implementation I expect to see something like the following: func (s *Scheduler) Schedule(ctx context.Context, req *types.Request) ([]*types.Result, error) {
profiles := s.profilePicker.Pick(req, s.availableProfiles) // selecting profiles is a function of the request
// Snapshot pod metrics from the datastore to:
// 1. Reduce concurrent access to the datastore.
// 2. Ensure consistent data during the scheduling operation of a request between all scheduling cycles.
sCtx := types.NewSchedulingContext(ctx, req, types.ToSchedulerPodMetrics(s.datastore.PodGetAll()))
result := []*types.Result{} // assuming the Result has a profileName field inside
for name, profile := range profiles {
// run the selected profiles and collect results
// runProfileCycle internally runs the plugins according to their order
cycleResult, err := s.runSchedulerProfileCycle(sCtx, profile)
if err != nil {
return result, fmt.Errorf("failed to run all required scheduling profiles - %w", err)
}
result = append(result, cycleResult)
}
return result, nil
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that here we just need a Schedule function. I am also not sure we necessarily need to make the I also recommend to define a type named |
||
Plugin | ||
Comment on lines
+46
to
+47
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why Scheduler is a plugin? |
||
// ProfileSelection selects scheduling profiles through the implemented | ||
// logic, and returns a subset of the registered scheduling profiles. | ||
ProfileSelection() map[string]SchedulingProfile | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This call should receive the request or some request properties in order to select the profiles (decision on which profiles should be based on the request props) |
||
|
||
// SchedulingProfiles lists all of the scheduling profiles registered | ||
// with the scheduler. | ||
SchedulingProfiles() map[string]SchedulingProfile | ||
|
||
// SchedulingResult takes the output of the result(s) of the scheduling cycle(s) | ||
// and makes sense of the data to be consumed by request control. | ||
// For example: suppose you have 2 profiles ShadowBoxing Profile & Production Profile. | ||
// SchedulingResult would know to simply log the result of ShadowBoxing | ||
// profile, and do nothing else with it. | ||
SchedulingResult(map[string][]Endpoint) SchedulingResult | ||
} | ||
|
||
// SchedulingProfile is an interface to used to describe a profile that will | ||
// run for a given scheduling cycle. | ||
type SchedulingProfile interface { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don’t think this should be an interface. IMO today’s SchedulingConfig should be renamed to SchedulingProfile. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Fair enough, this interface is written more like a struct |
||
Plugin | ||
// PreSchedulePlugins are optional, and will be ran at the start of a | ||
// scheduling cycle. This should be scoped to any foundational work needed | ||
// that is custom to this scheduling profile. | ||
PreSchedulePlugins() []PreSchedule | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend we rename pre/post schedule to pre/post cycle. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems reasonable enough |
||
// Filters lists all Filter plugins associated with this Profile. Filters | ||
// are optional. | ||
Filters() []Filter | ||
// Scorers lists all Score plugins associated with this Profile. At | ||
// least 1 scorer must be registered for a profile to be valid. | ||
Scorers() map[Scorer]int | ||
// Selection returns the function that picks the endpoint(s). | ||
Selection() Picker | ||
// PostSchedulePlugins lists all Filter plugins associated with this | ||
// Profile. PostSchedulePlugins are ran after every scheduling cycle, | ||
// and are optional. | ||
PostSchedulePlugins() []PostSchedule | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was removed from the diagram.
I was looking on Prefix implementation (as it is the only example that uses PreCycle atm) and noticed that:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So I'm glad you brought this up. I thought Pre/Post schedule was a requirement you brought over. I've struggled to see a case for them. The best I can see for Same thing with I support removing them. @ahg-g any opinion here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I recommend to not add any extensions that we don't have use cases for yet. We can always add new extensions, but we can't remove them once they are established. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. to summarize, I think we all agree pre/post should be removed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am thinking how we want to set the selected endpoint in the request response. One approach is to do that in a I am leaning towards the former to give us more flexibility in changing that behavior: e.g., for p/d, we will initially have to set one as the canonical endpoint that the proxy should forward the request to and the other as a header. Perhaps this can also help us implement fallbacks as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, the former seems more appropriate to me as result aggregation is opinionated & system based. If we set a standard we inject opinions, reducing the extensability.
We can update this fairly easily. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 on PostSchedule |
||
} | ||
|
||
// Preschedule will be ran at the start of a scheduling cycle. This should be | ||
// scoped to any foundational work needed that is custom to this scheduling | ||
// profile. | ||
type PreSchedule interface { | ||
Plugin | ||
PreSchedule(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) | ||
} | ||
|
||
// Filter runs before any scoring, and remove endpoints that are not fit for | ||
// selection. The framework will return an error to the client if the endpoints | ||
// are filtered to zero. | ||
type Filter interface { | ||
Plugin | ||
Filter(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint | ||
} | ||
|
||
// Scorer applies a score to each remaining endpoint provided. Scorers SHOULD | ||
// keep their score values in a normalized range: [0-1]. Any weighting should | ||
// be added at the SchedulingProfile configuration level. | ||
type Scorer interface { | ||
Plugin | ||
Score(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint | ||
} | ||
|
||
// Picker selects the endpoint(s) from the provided list of scored endpoints. | ||
// Picker MUST return, one endpoint at minimum. | ||
type Picker interface { | ||
Plugin | ||
Selection(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. function names are typically verbs or verb phrases, because they do something. |
||
} | ||
Comment on lines
+94
to
+115
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it seems like this definition is aiming to unify the Filter/Scorer/Picker interfaces (leaving Pre/Post out of discussions since we agreed they are not needed). additionally, none of the above interfaces receive the request or request properties which is a requirement. |
||
|
||
// PostSchedule runs per-scheduling cycle, and is part of a scheduling profile. | ||
// PostSchedule performs any remaining work needed for the scheduling cycle. | ||
// PostSchedule is not expected to change any values of the parameters. | ||
type PostSchedule interface { | ||
Plugin | ||
PostSchedule(ctx context.Context, state scheduling.CycleState, selectedEndpoints []Endpoint) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong image