scheduling changes for lora affinity load balancing #423

kaushikmitr · 2025-02-27T21:37:39Z

This pull request includes several changes to the deployment configuration, metrics collection, and scheduling logic. The most important changes include updating metrics collection to include waiting adapters, and implementing a new pod selection strategy that balances load while considering model affinity.

Scheduling Logic Enhancements:

pkg/epp/scheduling/filter.go: Replaced the loRAAffinityPredicate function with a new loRASoftAffinityPredicate function that prioritizes pods with existing model affinity while allowing for load balancing through randomization (as long as there is room to fit another adapter in the pod).
pkg/epp/scheduling/scheduler.go: Updated the scheduling configuration to use the new loRASoftAffinityPredicate function and increased the queueingThresholdLoRA value from 50 to 128. Added a new loraAffinityThreshold constant to indicate the probability of preferring pods with model affinity. [1] [2] [3]

Deployment Configuration Changes:

config/manifests/vllm/deployment.yaml: Added new command-line arguments for --compilation-config, --max-num-seqs, and --max-lora-rank. Added a new environment variable VLLM_USE_V1. [1] [2]

Metrics Collection Updates:

pkg/epp/backend/vllm/metrics.go: Added a new metric LoraRequestInfoWaitingAdaptersMetricName and updated the promToPodMetrics and getLatestLoraMetric functions to handle waiting adapters. Also pick the previous running + waiting adapters if there are no current running or waiting adapters [1] [2] [3]

k8s-ci-robot · 2025-02-27T21:37:49Z

Hi @kaushikmitr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-02-27T21:38:02Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`b5d7f8f`
🔍 Latest deploy log	https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67c7403f524362000864c096
😎 Deploy Preview	https://deploy-preview-423--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

ahg-g · 2025-02-27T21:48:33Z

/ok-to-test

ahg-g

I didn't look at the algorithm change yet, left a couple of quick comments.

config/manifests/vllm/deployment.yaml

ahg-g · 2025-02-27T23:26:55Z

config/manifests/vllm/deployment.yaml

          - "--lora-modules"
          - '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
          - '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
          env:
+            - name: VLLM_USE_V1
+              value: "1"


The released vllm version doesn't support our metrics yet, right? if so, then we can't use it now.

Yes, that is why the tests are failing. I will switch back to V0

I don't think that is, the integration test doesn't use this deployment yaml.

I think the test is failing because this PR introduces some randomness to the selection.

this one probably is failing because of V1: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_gateway-api-inference-extension/423/pull-gateway-api-inference-extension-verify-main/1895229573222633472

pkg/epp/backend/vllm/metrics.go

ahg-g

The algorithm is not using the waiting_lora_adapters metric, right?

pkg/epp/scheduling/scheduler.go

pkg/epp/backend/vllm/metrics.go

config/manifests/vllm/deployment.yaml

pkg/epp/backend/vllm/metrics.go

liu-cong · 2025-02-28T17:48:11Z

pkg/epp/scheduling/scheduler.go

-	// The value of 50 is arrived heuristicically based on experiments.
-	queueingThresholdLoRA = 50
+	// The value of 128 is arrived heuristicically based on experiments.
+	queueingThresholdLoRA = 128


I think we should make this configurable perhaps via a flag for now. Different environments will likely need different thresholds.

I would rather levarage this to make this configurable. #16

I don't think we have time to do API change for the next release. Given we already had to change it on different accelerator types, it's important to have this knob configurable. Exposing it as a flag seems straightforward and gives us time to gather feedback on this before making an API change.

I took at look, iiuc, adding this flag is not straightforward, the way scheduler is written. If its needed for next release would rather have it in another PR.

Defining a flag for each parameter is tedious, we can use a versioned configuration file, this is called ComponentConfig, ideally we do that for #383

Here is JobSet's config file as an example: https://github.com/kubernetes-sigs/jobset/tree/main/api/config/v1alpha1

Because setting an env var is not validated in general (i.e., if you set an env var that the binary doesn't care about, nothing will happen), while with flags it is more strict.

It is ugly either way, lack of validation is bad, but I would rather prioritize asserting that this is a temporary knob that will likely either not exist in the future or be set via a different mechanism.

I suggest to not block this PR on adding the env vars, and do that as a followup for this and other algorithm parameters.

sounds good to me. I am OK to lgtm once the integration test is fixed, and a unit test for the new filter

"but we need to evolve it in a way that the algorithm offers self tuning for the most part.": My concern is that we may never have a truly model-server configuration agnostic load balancing algorithm. For example, vLLM exposes settings like max_num_seq, max_lora, max_lora_rank, and max_num_batched_tokens, and the optimal load balancing strategy will depend on these parameters—many of which may not be directly available to the Gateway. While we could choose to scrape them as needed, I’m not sure if that’s the best design choice. I believe there will always be some scheduling configuration parameters that require independent tuning. For now, using environment variables works, but in the long run, we might want to make them configurable as load balancing parameters.

I agree, I am just not confident that the current set of parameters are the ones we will actually have moving forward. We are very early, and our benchmarking so far is limited to a few use cases, and so only when we have more benchmarks across different model sizes/accelerators/ datasets or deployed more widely in practice we will get the confidence of what knobs should be exposed.

pkg/epp/scheduling/filter.go

liu-cong · 2025-02-28T18:01:14Z

pkg/epp/scheduling/scheduler.go

+	queueingThresholdLoRA = 128
+	// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable.
+	// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters.
+	loraAffinityThreshold = 0.999


do you have some insights to show why this is needed and why this value is picked?

I picked it after some trail and error. This value worked well when we had skewed traffic for different adapters, helped spread out high QPS adapters while keeping low QPS adapters less spread out

I believe we need to update the hermetic_test case "select active lora, low queue", given the new probabilistic behavior. You can set the pods without the requested lora adapter to have no room so they will never be picked.

yes, the current test might fail sometimes. I tried to make it probablistic but its more complicated than I thought. For now fixed it as you suggested by setting the pods without the requested lora adapter to have no room so they will never be picked

pkg/epp/backend/vllm/metrics.go

kaushikmitr · 2025-02-28T22:27:17Z

The algorithm is not using the waiting_lora_adapters metric, right?

It is, we are now checking for both waiting + running to determine affinity

kaushikmitr · 2025-03-01T00:47:29Z

/retest

ahg-g · 2025-03-01T02:28:16Z

docs/proposals/003-model-server-protocol/README.md

@@ -47,3 +47,5 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
  requested adapter. Example: `"max_lora": "8"`.
  * `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
    memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"`
+  * `waiting_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
+    memory and ready to serve requests. Example: `"waiting_lora_adapters": "adapter1, adapter2"`


update the docs, this reads exactly the same as the running one

pkg/epp/backend/vllm/metrics.go

ahg-g · 2025-03-01T02:44:12Z

/approve
/hold

Thanks a lot, this is great, leaving it to Cong to lgtm.

k8s-ci-robot · 2025-03-01T02:44:20Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liu-cong · 2025-03-01T17:53:45Z

pkg/epp/scheduling/scheduler.go

+	queueingThresholdLoRA = 128
+	// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable.
+	// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters.
+	loraAffinityThreshold = 0.999


I believe we need to update the hermetic_test case "select active lora, low queue", given the new probabilistic behavior. You can set the pods without the requested lora adapter to have no room so they will never be picked.

pkg/epp/scheduling/filter.go

liu-cong · 2025-03-01T18:16:23Z

pkg/epp/scheduling/scheduler.go

-	// The value of 50 is arrived heuristicically based on experiments.
-	queueingThresholdLoRA = 50
+	// The value of 128 is arrived heuristicically based on experiments.
+	queueingThresholdLoRA = 128


This is a new "config" CRD? If we think about heterogeneous pools in the future, then we will need different configurations per "segment" of the pool. Though by carefully design the semantics such as the default config for the pool, and each segment can override, we can still be future proof. But we need a bit more discussion on this. Will create an issue and raise this in the community meeting.

For now, I still think adding a flag is the fastest way to unblock this.

adding this flag is not straightforward, the way scheduler is written.
If plumbing is too cumbersome, cou can define this as a flag in the current file, and then in main.go just reference it like _ = scheduler.LoraAffinityThreshold, this should work.

liu-cong

/lgtm with a few nits. Feel free to ping me to lgtm again if you want do address them

liu-cong · 2025-03-03T22:20:06Z

test/integration/hermetic_test.go

@@ -142,6 +142,7 @@ func TestKubeInferenceModelRequest(t *testing.T) {
 					KVCacheUsagePercent: 0.2,
 					ActiveModels: map[string]int{
 						"foo": 1,
+						"bar": 1,


nit: Can you explain this in the comment line 121 that because no pods have room for new lora, therefore the pod with lora affinity will always be picked?

liu-cong · 2025-03-03T22:28:02Z

pkg/epp/scheduling/filter_test.go

+	logger := logutil.NewTestLogger()
+
+	const (
+		testModelName     = "test-model"


nit: not used?

liu-cong · 2025-03-03T22:29:10Z

pkg/epp/scheduling/filter_test.go

+	const (
+		testModelName     = "test-model"
+		testAffinityModel = "test-affinity-model"
+		numIterations     = 10000


having to run 10k times seems a lot. How long does it take? Any chance we can reduce it?

it takes 0.29 seconds, I had it 1000 initially which took 0.16 seconds. 1000 is fine but given the probabiliy i have is very skewed 99.9%, 10000 gives assigns some pods to both cases (~9990, 100)

If the issue is the 99.9% threshold, you can change the percentage in tests to make it easier to test, like 90%

yup thats a bit complicated though (for now) as the threshold is defined in the scheduler as const and the filter template does not take as input any arbitrary parameter. I have to update the const for the test and reset it back. I would rather test it against whats already set in the scheduler, 0.29 sec should not be too bad, if it is I can change it go 1000 (the test would still work, since it has a tolerance for 5%)

liu-cong · 2025-03-03T22:33:48Z

pkg/epp/scheduling/filter_test.go

+		}
+
+		// Identify if the returned pod is the affinity pod or available pod
+		if _, exists := result[0].ActiveModels[testAffinityModel]; exists {


nit: I recommend checking the pod name directly, it's more readable and reliable. (imagine in the future the ActiveModels is modified during tests for some reason).

pkg/epp/scheduling/filter_test.go

config/manifests/vllm/deployment.yaml

liu-cong · 2025-03-04T18:26:54Z

/lgtm

Thanks @kaushikmitr!

This PR introduces a unit test based on probabilities. We chatted offline with @kaushikmitr that with the large # of iterations, no flakiness was found with 100 test runs. So despite the test is not deterministic, in practice the test should be very reliable.

liu-cong · 2025-03-04T18:27:20Z

/unhold

scheduling changes for lora affinity load balancing

ad15e84

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2025

k8s-ci-robot requested review from Jeffwan and liu-cong February 27, 2025 21:37

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 27, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2025

ahg-g reviewed Feb 28, 2025

View reviewed changes

pkg/epp/scheduling/scheduler.go Show resolved Hide resolved

pkg/epp/backend/vllm/metrics.go Show resolved Hide resolved

liu-cong reviewed Feb 28, 2025

View reviewed changes

kaushikmitr added 3 commits March 1, 2025 00:28

refactor unit tests, address comments

b9f57c5

restore vllm deployment manifest

9e94fd9

update README for model server protocol to add waiting lora adapters

2b934d0

kaushikmitr added 2 commits March 1, 2025 00:57

remove unused variables

2d3a3bb

removed unused func

323e141

ahg-g reviewed Mar 1, 2025

View reviewed changes

pkg/epp/backend/vllm/metrics.go Show resolved Hide resolved

fix model protocol readme

41ec5b8

ahg-g reviewed Mar 1, 2025

View reviewed changes

pkg/epp/backend/vllm/metrics.go Outdated Show resolved Hide resolved

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2025

liu-cong reviewed Mar 1, 2025

View reviewed changes

ahg-g mentioned this pull request Mar 1, 2025

Tracking issues for release 0.2.0 #362

Closed

33 tasks

kaushikmitr added 2 commits March 3, 2025 21:21

fix hermetic test for select active lora, low queue

2991617

update comment in metrics.go in vllm backend

be3ce8b

liu-cong reviewed Mar 3, 2025

View reviewed changes

ahg-g reviewed Mar 4, 2025

View reviewed changes

config/manifests/vllm/deployment.yaml Outdated Show resolved Hide resolved

add filter test TestLoRASoftAffinityDistribution

71b95e6

kaushikmitr force-pushed the main branch from e29b067 to 71b95e6 Compare March 4, 2025 02:10

kaushikmitr added 2 commits March 3, 2025 18:14

Merge branch 'kubernetes-sigs:main' into main

db5fcdb

restore vllm manifest

d6093ce

ahg-g mentioned this pull request Mar 4, 2025

Create env vars for the algorithm's scheduling parameters #447

Closed

update unit test

b5d7f8f

k8s-ci-robot assigned liu-cong Mar 4, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2025

k8s-ci-robot merged commit dfe8d9c into kubernetes-sigs:main Mar 4, 2025
8 checks passed

scheduling changes for lora affinity load balancing #423

scheduling changes for lora affinity load balancing #423

Conversation

kaushikmitr commented Feb 27, 2025 • edited Loading

Scheduling Logic Enhancements:

Deployment Configuration Changes:

Metrics Collection Updates:

k8s-ci-robot commented Feb 27, 2025

netlify bot commented Feb 27, 2025 • edited Loading

✅ Deploy Preview for gateway-api-inference-extension ready!

ahg-g commented Feb 27, 2025

ahg-g left a comment

Choose a reason for hiding this comment

ahg-g Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaushikmitr Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaushikmitr commented Feb 28, 2025

kaushikmitr commented Mar 1, 2025

Choose a reason for hiding this comment

ahg-g commented Mar 1, 2025

k8s-ci-robot commented Mar 1, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liu-cong left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liu-cong commented Mar 4, 2025

liu-cong commented Mar 4, 2025

kaushikmitr commented Feb 27, 2025 •

edited

Loading

netlify bot commented Feb 27, 2025 •

edited

Loading

ahg-g Feb 27, 2025 •

edited

Loading

kaushikmitr Mar 3, 2025 •

edited

Loading