[WIP] Scheduler Subsystem interface #845

kfswain · 2025-05-15T21:30:18Z

This PR extends the EPP Architecture proposal to provide a concrete definition for the Scheduler Subsystem.

Currently updating the readme, but cutting the PR to discuss the interface

k8s-ci-robot · 2025-05-15T21:30:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kfswain]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2025-05-15T21:30:27Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`05f37ad`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/682664718cfe85000842814e
😎 Deploy Preview	https://deploy-preview-845--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

nirrozenbaum · 2025-05-16T10:28:45Z

docs/proposals/0683-epp-architecture-proposal/images/examples/example.yaml

@@ -0,0 +1,34 @@
+#names are egregiously long, but attempting to descibe custom logic within a name


there are two different points to address here:

extend the existing scheduler to work with multiple scheduling cycles and define the extension points. At this stage scheduler should be extensible via code.

After point 1 is settled, the next step is to configure it using a configuration file. there’s more than that though, like how do we load out of tree plugins (e.g., like kube scheduler using shared obj?). Anyway, I wouldn’t go into configuration file as this is a very advanced stage and current example seems like it might be missing fields (e.g., parameters for each plugin). we first need to define how it works, what are extension points and their names, under which layer each extension falls, etc.

bottom line - I would leave configuration file out of scope at this point.

I actually disagree here. Your other comment, makes the point that a SchedulingProfile is more of a struct, and I think that actually emphasizes the need to have a config file. B/C if a profile is a struct, it needs a way for a user to express how that struct is populated.

If we don't provide config, we demand that anyone developing their algo rebuild their binary every single time they make a change, rather than update a config file and redeploy the EPP. I think the sooner much of this can be expressed outside of a new build, the better the experience, & the higher the rate of adoption.

parameters for each plugin

This is a great design question to bring up & something we should hash out. Should each profile have it's own separate set of configurable knobs (example: kv_cache_util). I could see that having merit.

I think the sooner much of this can be expressed outside of a new build, the better the experience, & the higher the rate of adoption.

just to clarify - of course I agree that a configuration file is needed.

I was trying to make a point that same as a big PR is much more convenient to discuss when broken into smaller PRs, I think these are two completely separate topics and we could benefit from separating the discussions.

one topic is having a multi cycle scheduler and another topic is how to express the scheduler configuration in a file including the out of tree plugins handling. each of these topics is big on its own and can be solved separately.

nirrozenbaum · 2025-05-16T10:33:48Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+
+// SchedulingProfile is an interface to used to describe a profile that will
+// run for a given scheduling cycle.
+type SchedulingProfile interface {


I don’t think this should be an interface.
interface is useful for different implementations that have different logic. all profiles have the same fields and same logic (all have known list of plugins: pre, filters, .. ). different profiles are just different instances of the same SchedulingProfile struct.

IMO today’s SchedulingConfig should be renamed to SchedulingProfile.

IMO today’s SchedulingConfig should be renamed to SchedulingProfile.

Fair enough, this interface is written more like a struct

nirrozenbaum · 2025-05-16T10:35:48Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+
+type Endpoint interface {
+	GetState() EndpointState
+	GetScore() float32


score is NOT a property of an endpoint.
it’s a result of scorers invocation in one cycle. this shouldn’t be kept as a field in endpoint (we’ve been in this discussion before in the latest scheduler changes)

Is you prefer a different name, that is fine, but ultimately each endpoint in a Scheduling Cycle should have a score associated with it, however that is done. The scope of this field would be contained to the scheduler subsystem. So while I agree this is not a field on a ModelServer struct, with a lifecycle that is longer than a single scheduling cycle. An object with the same lifecycle as the scheduling cycle it is in seems reasonable to have a score field attached to it.

today we have in a cycle a map[endpoint]float64 which lives in the scope of a cycle and is expressing the score of an endpoint during the cycle.

I don’t see a need for a struct/interface with SetScore/GetScore

nirrozenbaum · 2025-05-16T10:42:03Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+limitations under the License.
+*/
+
+package framework


This file feels like we’re trying to reinvent the wheel. many of these interfaces already exist while here there are a lot of missing things (that were updated already in existing interfaces). for example - no request headers, the return values are wrong, etc.
I think for the purpose of discussion, it would be much better to discuss on text and get an agreement on where are the extension points, the responsibility of each extension, etc.
I think this file goes too much into details and not serving the purpose.

This is an attempt to have a reference to the refactoring currently being done to make the scheduler pluggable. We are currently doing that over the individual PRs in an adhoc manner, this is an attempt to organize that a bit. We can move to a doc, we already have one that we can utilize, but eventually this should make it back to the repo as a canonical proposal.

What I am roughly hoping we align on before making further refactoring is:

Define the concepts of scheduling cycle, scheduling profile, endpoint picking profile and endpoint picking cycle.

The list of extension points and the interface to invoke each of them

The data types (CycleState, EndpointState) and their scope and lifecycle

yup, I think we’re aligned on the final goal. just not sure that iterating over go interfaces is better than discussing on a doc.

This file feels like we’re trying to reinvent the wheel.

Not really, this is an iteration on what we have currently proposed that scrubs out opinionated decisions made (such as usage of K8s objects) which is not general.

just not sure that iterating over go interfaces is better than discussing on a doc.

Both convey the idea of generalization & how it would be approached. The interface file is intentionally in a proposal to indicate that it's simply for the sake of discussion.

the return values are wrong

the return values are different

no request headers

A scheduler subsystem should not care about a higher level protocol. For instance, a scheduler should be able to be implemented & ran on a gRPC protocol that prefers to use metadata over headers. A neutral data structure disconnects any dependency on HTTP.

there are a lot of missing things

there are a lot of things that changed

Agreed that we could continue discussion in a doc. This is WIP so people could take a look while I further populate the markdown doc with comments, constraints, & thoughts.
My problem with a doc is that it is not always considered canonical. Whereas a proposal, with an agreed upon PR with discrete approvals, reinforces the credibility of the decisions made. I'll be adding quite a bit more thoughts to the markdown doc before I pull the WIP tag if that is sufficient?

nirrozenbaum · 2025-05-17T06:14:47Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+	// PreSchedulePlugins are optional, and will be ran at the start of a
+	// scheduling cycle. This should be scoped to any foundational work needed
+	// that is custom to this scheduling profile.
+	PreSchedulePlugins() []PreSchedule


I recommend we rename pre/post schedule to pre/post cycle.
current namings may become very confusing once the schedule becomes multi-cycle.
In my view these names represent the following:
PreSchedule == ProfileSelection
PostSchedule == results aggregation

Seems reasonable enough

nirrozenbaum · 2025-05-18T06:56:57Z

docs/proposals/0683-epp-architecture-proposal/README.md


 The Scheduler will define a strong interface API, so that new scheduling algos may be plugged & dark-launched to test in production traffic without impacting said traffic. Extension is expected to adhere to the [Scheduler Subsystem definition](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/603)

+
+<img src="./images/epp_arch.svg" alt="Scheduling Algorithm" width="1000" />


wrong image

nirrozenbaum · 2025-05-19T07:12:34Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+	Plugin
+	// ProfileSelection selects scheduling profiles through the implemented
+	// logic, and returns a subset of the registered scheduling profiles.
+	ProfileSelection() map[string]SchedulingProfile


This call should receive the request or some request properties in order to select the profiles (decision on which profiles should be based on the request props)

nirrozenbaum · 2025-05-19T07:15:09Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+type Scheduler interface {
+	Plugin


why Scheduler is a plugin?

nirrozenbaum · 2025-05-19T15:19:02Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+	// PostSchedulePlugins lists all Filter plugins associated with this
+	// Profile. PostSchedulePlugins are ran after every scheduling cycle,
+	// and are optional.
+	PostSchedulePlugins() []PostSchedule


This was removed from the diagram.
two questions regarding this:

I do wonder if we actually need Pre/Post Cycle. what is the use case for using them? can we do without them?

if we do keep PreCycle, I think we need also PostCycle in the diagram for symmetry.

I was looking on Prefix implementation (as it is the only example that uses PreCycle atm) and noticed that:

PreCycle is used to call matchLongestPrefix which internally updates servers with their longest prefix match.

this in theory may calculate longest prefix for servers that will be filtered by the filters. that means we might do redundant calculations as those servers will not be candidates for selection.

it sounds to me like we could call matchLongestPrefix at the beginning of the Score function and calculate prefix only for the relevant servers.

bottom line, the only use case where we currently use PreCycle is implemented in an insufficient way (calculating for all servers instead of just the ones that will remain relevant after filtering). intuitively, I would argue that we can remove Pre/PostCycle completely as I was not convinced (yet) that those are needed.

So I'm glad you brought this up.

I thought Pre/Post schedule was a requirement you brought over. I've struggled to see a case for them.

The best I can see for Pre- is some sort of per-cycle digestion of data that will be used in multiple filters/scorers. But that could still technically happen in the ProfileSelection implementation.

Same thing with Post that could be some per-cycle` metrics emission or something. But that could likely happen in the aggregration step also.

I support removing them. @ahg-g any opinion here?

I recommend to not add any extensions that we don't have use cases for yet. We can always add new extensions, but we can't remove them once they are established.

nirrozenbaum · 2025-05-20T04:34:21Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+type Plugin interface {
+	Name() string
+}


just a thought -
having a Name function is enough to be a plugin. this looks a big strange (any struct with Name function is a plugin?).
plugin should have some kind of Run function.

nirrozenbaum · 2025-05-20T04:46:48Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+	results map[string][]Endpoint
+}
+
+type Scheduler interface {


I completely disagree on the Scheduler functions.
I think the Scheduler should have a single (public) function which is Schedule. all the others that are specified here are internal functions of the scheduler and not the public interface.

to break it down more, when calling to scheduler -
do you expect that the caller (e.g., director) will call ProfileSelection, then director goes iteratively on the profiles and for each it invokes the plugins in the correct order?
I don't think that was your intention.

what I'm expecting to see here is something along these lines:

Schedule(ctx context.Context, *schedulingtypes.Request) (result []*schedulingtypes.Result, err error)

and internally in the scheduler implementation I expect to see something like the following:

func (s *Scheduler) Schedule(ctx context.Context, req *types.Request) ([]*types.Result, error) { profiles := s.profilePicker.Pick(req, s.availableProfiles) // selecting profiles is a function of the request // Snapshot pod metrics from the datastore to: // 1. Reduce concurrent access to the datastore. // 2. Ensure consistent data during the scheduling operation of a request between all scheduling cycles. sCtx := types.NewSchedulingContext(ctx, req, types.ToSchedulerPodMetrics(s.datastore.PodGetAll())) result := []*types.Result{} // assuming the Result has a profileName field inside for name, profile := range profiles { // run the selected profiles and collect results // runProfileCycle internally runs the plugins according to their order cycleResult, err := s.runSchedulerProfileCycle(sCtx, profile) if err != nil { return result, fmt.Errorf("failed to run all required scheduling profiles - %w", err) } result = append(result, cycleResult) } return result, nil }

nirrozenbaum · 2025-05-20T04:51:17Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+}
+
+type EndpointState struct {
+	// only need to use a sync.Map if we do not plan on snapshotting data.


snapshotting data sounds like a good thing, otherwise we might have inconsistencies between different points of the same cycle or between different cycles.

nirrozenbaum · 2025-05-20T04:59:14Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+type SchedulingResult struct {
+	results map[string][]Endpoint
+}


is this type trying to encapsulate the multi-cycle results and map profileName -> it's result?
I would define a SchedulingResult per cycle and not per the complete schedule function.
then the return value of Schedule call can be either []*SchedulingResult (e.g., if the result includes the profile name) or alternatively map[string]*SchedulingResult (map from profile name to the result of the profile cycle invocation).

this makes it clearer that each Profile cycle has it's own result.

additionally, a cycle result might have more than []Endpoint as a return value e.g., if we have equally scored endpoints and additional endpoints with very close score - we might want to have bestScoreEndpoints and backupEndpoints, and return them all to the PostSchedule to let the plugin decide which endpoint to use.

each cycle is making a local optimized endpoint selection. when receiving all cycles selection, we may want to do global optimization. for example in P/D (cycle for P and cycle for D) -
we may want to select not the best but the second best endpoint since it puts P and D on the same region (this is just a simple example to emphasize the idea).

nirrozenbaum · 2025-05-20T05:08:31Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+// Filter runs before any scoring, and remove endpoints that are not fit for
+// selection. The framework will return an error to the client if the endpoints
+// are filtered to zero.
+type Filter interface {
+	Plugin
+	Filter(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
+}
+
+// Scorer applies a score to each remaining endpoint provided. Scorers SHOULD
+// keep their score values in a normalized range: [0-1]. Any weighting should
+// be added at the SchedulingProfile configuration level.
+type Scorer interface {
+	Plugin
+	Score(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
+}
+
+// Picker selects the endpoint(s) from the provided list of scored endpoints.
+// Picker MUST return, one endpoint at minimum.
+type Picker interface {
+	Plugin
+	Selection(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint
+}


it seems like this definition is aiming to unify the Filter/Scorer/Picker interfaces (leaving Pre/Post out of discussions since we agreed they are not needed).
I think this is wrong, as every plugin has it's own unique functionality and the types should be different accordingly.
for example, this has caused to add Get/SetScore functions to Endpoint (although Score is not a property of Endpoint as wrote in other comment).

additionally, none of the above interfaces receive the request or request properties which is a requirement.
Some filters/scorers may depend on some request properties, like the target model or the prompt (e.g., scorer that scores endpoints if the endpoint ActiveModels has the request target model in it).

nirrozenbaum · 2025-05-20T07:23:44Z

docs/proposals/0683-epp-architecture-proposal/interfaces/interface.go

+// Picker MUST return, one endpoint at minimum.
+type Picker interface {
+	Plugin
+	Selection(ctx context.Context, state scheduling.CycleState, endpoints []Endpoint) []Endpoint


function names are typically verbs or verb phrases, because they do something.
Pick/Select conveys an action: choosing the best endpoint.
Selection sounds like a data type or result, not an action. It implies an object, not behavior.
(e.g., Selection object could represents the result of a selection process).
so I suggest -
if we call the interface Picker - function should be called Pick.
if we call it Selector the function name should be called Select

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 15, 2025

k8s-ci-robot requested review from ahg-g and nirrozenbaum May 15, 2025 21:30

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 15, 2025

initial sketching of interfacing

05f37ad

kfswain force-pushed the epp-framework-work branch from e126d00 to 05f37ad Compare May 15, 2025 22:02

nirrozenbaum reviewed May 16, 2025

View reviewed changes

nirrozenbaum reviewed May 17, 2025

View reviewed changes

nirrozenbaum reviewed May 18, 2025

View reviewed changes

nirrozenbaum reviewed May 19, 2025

View reviewed changes

nirrozenbaum reviewed May 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Scheduler Subsystem interface #845

[WIP] Scheduler Subsystem interface #845

kfswain commented May 15, 2025

k8s-ci-robot commented May 15, 2025

netlify bot commented May 15, 2025 •

edited

Loading

nirrozenbaum May 16, 2025 •

edited

Loading

kfswain May 18, 2025 •

edited

Loading

nirrozenbaum May 19, 2025

nirrozenbaum May 16, 2025 •

edited

Loading

kfswain May 18, 2025

nirrozenbaum May 16, 2025 •

edited

Loading

kfswain May 18, 2025

nirrozenbaum May 19, 2025

nirrozenbaum May 16, 2025

ahg-g May 16, 2025 •

edited

Loading

nirrozenbaum May 16, 2025

kfswain May 18, 2025 •

edited

Loading

nirrozenbaum May 17, 2025 •

edited

Loading

kfswain May 18, 2025

nirrozenbaum May 18, 2025

nirrozenbaum May 19, 2025

nirrozenbaum May 19, 2025

nirrozenbaum May 19, 2025

kfswain May 19, 2025

ahg-g May 19, 2025

nirrozenbaum May 20, 2025

nirrozenbaum May 20, 2025

nirrozenbaum May 20, 2025

nirrozenbaum May 20, 2025

nirrozenbaum May 20, 2025

nirrozenbaum May 20, 2025 •

edited

Loading

		@@ -0,0 +1,34 @@
		#names are egregiously long, but attempting to descibe custom logic within a name


		The Scheduler will define a strong interface API, so that new scheduling algos may be plugged & dark-launched to test in production traffic without impacting said traffic. Extension is expected to adhere to the [Scheduler Subsystem definition](https://github.com/kubernetes-sigs/gateway-api-inference-extension/pull/603)


		<img src="./images/epp_arch.svg" alt="Scheduling Algorithm" width="1000" />

[WIP] Scheduler Subsystem interface #845

Are you sure you want to change the base?

[WIP] Scheduler Subsystem interface #845

Conversation

kfswain commented May 15, 2025

k8s-ci-robot commented May 15, 2025

netlify bot commented May 15, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

nirrozenbaum May 16, 2025 • edited Loading

Choose a reason for hiding this comment

kfswain May 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nirrozenbaum May 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nirrozenbaum May 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g May 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kfswain May 18, 2025 • edited Loading

Choose a reason for hiding this comment

nirrozenbaum May 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nirrozenbaum May 20, 2025 • edited Loading

Choose a reason for hiding this comment

netlify bot commented May 15, 2025 •

edited

Loading

nirrozenbaum May 16, 2025 •

edited

Loading

kfswain May 18, 2025 •

edited

Loading

nirrozenbaum May 16, 2025 •

edited

Loading

nirrozenbaum May 16, 2025 •

edited

Loading

ahg-g May 16, 2025 •

edited

Loading

kfswain May 18, 2025 •

edited

Loading

nirrozenbaum May 17, 2025 •

edited

Loading

nirrozenbaum May 20, 2025 •

edited

Loading