Refactor: Externalize Scheduler's saturation logic and criticality-based service differentiation #805

LukeAVanDrie · 2025-05-08T20:26:35Z

This commit refactors the request processing pipeline, externalizing saturation detection and criticality-based service differentiation from the Scheduler. These responsibilities are now primarily managed by the RequestControl.Director.

This change is a preparatory step for the introduction of a new Flow Controller component, which will eventually absorb these admission control duties.

Diff base is: #808 (split out for easier reviewing)
Related to: #674

Key changes include:

Introduced PreDispatch method to RequestControl.Director. It utilizes the SaturationDetector for admission control of non-critical requests and handles request criticality to determine if saturation checks are bypassed.
The saturation detection logic for dropping non-critical requests is intentionally preserved within the Director at this stage. This allows the option to bypass the future Flow Controller component during its maturation, ensuring the existing saturation and sheddable request behavior can be maintained as a fallback.
Updated main.go to instantiate the SaturationDetector, wiring it into the request handling flow.
Updated director_test.go to align with the new component responsibilities, adding additional coverage where necessary.

Missing from this PR:

Simplifying the Scheduler to focus solely on preference-based filtering and pod selection for requests that have already been admitted by the Director.
Removing the SheddableRequestFilter and the distinct critical/sheddable filter paths from the Scheduler's internal logic so that the Scheduler only applies a single, unified preference filter chain to all incoming requests.

I did not include the above in this PR due to high activity in those files. I will send a followup PR to address that. In the meantime, the saturation check happens twice: once in the Director, and then another redundant time in the Scheduler. This is wasted compute, but has no affect on behavior.

This refactoring leads to a cleaner architecture, making the Scheduler a more focused component and centralizing initial admission control logic, while paving the way for the future Flow Controller.

This is aligned with the direction in 0683-epp-architecture-proposal and is no-op in terms of EPP behavior.

netlify · 2025-05-08T20:26:40Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`fd52325`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/682768d9b48ae40008a40ba8
😎 Deploy Preview	https://deploy-preview-805--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

k8s-ci-robot · 2025-05-08T20:26:41Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign kfswain for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-05-08T20:26:45Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

pkg/epp/scheduling/config.go

ahg-g · 2025-05-08T20:29:01Z

/ok-to-test

pkg/epp/scheduling/scheduler.go

LukeAVanDrie · 2025-05-08T20:30:36Z

This change should be no-op. @liu-cong, I will leave it up to your discretion whether this needs proper regression testing.

LukeAVanDrie · 2025-05-08T20:36:07Z

I split out the addition of the saturation detector subdir into a separate PR to be submitted before this one (#808 ). It is just unused until this PR gets submitted, wiring it up.

This commit refactors the request processing pipeline, externalizing saturation detection and criticality-based service differentiation from the Scheduler. These responsibilities are now primarily managed by the RequestControl.Director. This change is a preparatory step for the introduction of a new Flow Controller component, which will eventually absorb these admission control duties. Key changes include: - Introduced `PreDispatch` method to `RequestControl.Director` It utilizes the `SaturationDetector` for admission control of non-critical requests and handles request criticality to determine if saturation checks are bypassed. - The saturation detection logic for dropping non-critical requests is intentionally preserved within the `Director` at this stage. This allows the option to bypass the future Flow Controller component during its maturation, ensuring the existing saturation and sheddable request behavior can be maintained as a fallback. - Updated `main.go` to instantiate the `SaturationDetector`, wiring it into the request handling flow. - Updated `director_test.go` to align with the new component responsibilities, adding additional coverage where necessary. This refactoring leads to a cleaner architecture, making the `Scheduler` a more focused component and centralizing initial admission control logic while paving the way for the future Flow Controller. This is aligned with the direction in `0683-epp-architecture-proposal` and should be nearly no-op in terms of EPP behavior.

LukeAVanDrie · 2025-05-16T16:40:45Z

cmd/epp/main.go

@@ -207,47 +211,62 @@ func run() error {
 		}
 		schedulerConfig := scheduling.NewSchedulerConfig(
 			[]plugins.PreSchedule{},
-			[]plugins.Filter{filter.NewSheddableCapacityFilter()},
+			[]plugins.Filter{},


@liu-cong I can also do this in the next PR when I actually remove this from the scheduler. Right now, I have only removed it from scheduler v2, not the original decision tree filter. If we want to bundle that together in a single PR, I can revert this line for now.

LukeAVanDrie · 2025-05-16T16:43:04Z

pkg/epp/requestcontrol/director_test.go

@@ -351,12 +548,9 @@ func TestRandomWeightedDraw(t *testing.T) {
 	var seedVal int64 = 420
 	for _, test := range tests {
 		t.Run(test.name, func(t *testing.T) {
-			for range 10000 {


This was always testing a deterministic seed, so this loop did nothing to verify statistical properties. Removed for now until someone wants to update the tests to actually make assertions for statistical properties on arbitrary seeds.

LukeAVanDrie · 2025-05-16T16:43:27Z

pkg/epp/requestcontrol/director_test.go

@@ -414,3 +608,40 @@ func TestGetRandomPod(t *testing.T) {
 func pointer(v int32) *int32 {
 	return &v
 }
+
+func TestDirector_HandleResponse(t *testing.T) {


New test coverage. We had 0 coverage on this method.

LukeAVanDrie · 2025-05-16T16:44:26Z

pkg/epp/requestcontrol/director_test.go

+}
+
+// mockScheduler is a configurable mock for the Scheduler interface.
+type mockScheduler struct {


I replaced real scheduler instances with a mock in these tests. Consequently, they are no longer "integration"-like tests. Wanted to call that out in case that is a concern. I think using a mock here is more appropriate though.

This is very complex, do you really need all those fields than just injecting a scheduling result and error?

liu-cong · 2025-05-16T20:52:05Z

pkg/epp/requestcontrol/director.go

-		return reqCtx, errutil.Error{Code: errutil.Internal, Msg: "results must be greater than zero"}
+	// Currently only get a single result. Will refactor to pluggably implement
+	// the PostSchedule.
+	if len(results) == 0 || results[0] == nil || results[0].TargetPod == nil || results[0].TargetPod.GetPod() == nil {


I am fine with sparse defensive code but in general I would not recommend. We can not afford defensive coding everywhere. The scheduler should be implemented to throw an error if any of this happens.

liu-cong · 2025-05-16T20:53:18Z

pkg/epp/requestcontrol/director.go

+func (d *Director) PreDispatch(ctx context.Context, reqCtx *handlers.RequestContext, reqCriticality v1alpha2.Criticality) error {
+	logger := log.FromContext(ctx)
+	logger.V(logutil.DEBUG).Info("Performing saturation check if request is non-critical.")
+	if d.saturationDetector == nil {


Let's try to avoid these checks and always provide a non-nil detector.

liu-cong · 2025-05-16T20:54:55Z

pkg/epp/requestcontrol/director.go

+	if reqCriticality != v1alpha2.Critical && d.saturationDetector.IsSaturated(ctx) {
+		logger.Info("System saturated, dropping non-critical request")
+		return errutil.Error{
+			Code: errutil.InferencePoolResourceExhausted,
+			Msg:  "system saturated, non-critical request dropped",
+		}
+	}


Suggested change

if reqCriticality != v1alpha2.Critical && d.saturationDetector.IsSaturated(ctx) {

logger.Info("System saturated, dropping non-critical request")

return errutil.Error{

Code: errutil.InferencePoolResourceExhausted,

Msg: "system saturated, non-critical request dropped",

}

}

if reqCriticality == v1alpha2.Critical {

return

}

if d.saturationDetector.IsSaturated(ctx) {

logger.Info("System saturated, dropping non-critical request")

return errutil.Error{

Code: errutil.InferencePoolResourceExhausted,

Msg: "system saturated, non-critical request dropped",

}

}

liu-cong · 2025-05-16T20:56:52Z

pkg/epp/requestcontrol/director.go

+
+	// Check saturation directly ONLY for non-critical requests.
+	if reqCriticality != v1alpha2.Critical && d.saturationDetector.IsSaturated(ctx) {
+		logger.Info("System saturated, dropping non-critical request")


This is already logged by the caller when the error is returned.

Generally, should I remove all instances of logging in director.go before an error is returned?

Generally,

Errors should be handled by the caller (log it, handle it, etc.) so no need to double log.

Prefer the caller to log instead of in the helper method if applicable. Some DEBUG/TRACE logs in the helpers methods are OK.

liu-cong · 2025-05-16T20:59:03Z

cmd/epp/main.go

 	ctx := ctrl.SetupSignalHandler()
+	appDatastore := datastore.NewDatastore(ctx, pmf)


why appDatastore and appScheduler?

I did this to avoid collisions with the package names. This is not strictly necessary though.

perhaps call them ds sched

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 8, 2025

k8s-ci-robot requested review from liu-cong and robscott May 8, 2025 20:26

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label May 8, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 8, 2025

LukeAVanDrie commented May 8, 2025

View reviewed changes

pkg/epp/scheduling/config.go Outdated Show resolved Hide resolved

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 8, 2025

LukeAVanDrie commented May 8, 2025

View reviewed changes

pkg/epp/scheduling/scheduler.go Outdated Show resolved Hide resolved

LukeAVanDrie force-pushed the saturation-detector branch from 112b943 to 48cc9a0 Compare May 8, 2025 20:51

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2025

LukeAVanDrie force-pushed the saturation-detector branch 3 times, most recently from a3d9090 to 9d273fa Compare May 9, 2025 02:49

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 9, 2025

LukeAVanDrie mentioned this pull request May 9, 2025

Introduce SaturationDetector component #808

Merged

LukeAVanDrie force-pushed the saturation-detector branch from 9d273fa to 83486ac Compare May 9, 2025 03:26

LukeAVanDrie mentioned this pull request May 9, 2025

merge has capacity filter with sheddable filter. #809

Merged

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 10, 2025

LukeAVanDrie force-pushed the saturation-detector branch from 83486ac to 4a7de3f Compare May 13, 2025 02:11

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels May 13, 2025

LukeAVanDrie force-pushed the saturation-detector branch from 4a7de3f to 44a11af Compare May 16, 2025 00:53

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2025

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 16, 2025

LukeAVanDrie force-pushed the saturation-detector branch 2 times, most recently from 1081d6a to 5f348a9 Compare May 16, 2025 01:25

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2025

LukeAVanDrie force-pushed the saturation-detector branch from 5f348a9 to fd52325 Compare May 16, 2025 16:33

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 16, 2025

LukeAVanDrie commented May 16, 2025

View reviewed changes

liu-cong reviewed May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Externalize Scheduler's saturation logic and criticality-based service differentiation #805

Refactor: Externalize Scheduler's saturation logic and criticality-based service differentiation #805

LukeAVanDrie commented May 8, 2025 •

edited

Loading

netlify bot commented May 8, 2025 •

edited

Loading

k8s-ci-robot commented May 8, 2025

k8s-ci-robot commented May 8, 2025

ahg-g commented May 8, 2025

LukeAVanDrie commented May 8, 2025 •

edited

Loading

LukeAVanDrie commented May 8, 2025 •

edited

Loading

LukeAVanDrie May 16, 2025

LukeAVanDrie May 16, 2025

LukeAVanDrie May 16, 2025

LukeAVanDrie May 16, 2025

liu-cong May 16, 2025

liu-cong May 16, 2025

liu-cong May 16, 2025

liu-cong May 16, 2025

liu-cong May 16, 2025

LukeAVanDrie May 16, 2025

liu-cong May 16, 2025

liu-cong May 16, 2025

LukeAVanDrie May 16, 2025

liu-cong May 16, 2025

		ctx := ctrl.SetupSignalHandler()
		appDatastore := datastore.NewDatastore(ctx, pmf)

Refactor: Externalize Scheduler's saturation logic and criticality-based service differentiation #805

Are you sure you want to change the base?

Refactor: Externalize Scheduler's saturation logic and criticality-based service differentiation #805

Conversation

LukeAVanDrie commented May 8, 2025 • edited Loading

netlify bot commented May 8, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

k8s-ci-robot commented May 8, 2025

k8s-ci-robot commented May 8, 2025

ahg-g commented May 8, 2025

LukeAVanDrie commented May 8, 2025 • edited Loading

LukeAVanDrie commented May 8, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LukeAVanDrie commented May 8, 2025 •

edited

Loading

netlify bot commented May 8, 2025 •

edited

Loading

LukeAVanDrie commented May 8, 2025 •

edited

Loading

LukeAVanDrie commented May 8, 2025 •

edited

Loading