Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduling changes for lora affinity load balancing #423

Merged
merged 13 commits into from
Mar 4, 2025

Conversation

kaushikmitr
Copy link
Contributor

@kaushikmitr kaushikmitr commented Feb 27, 2025

This pull request includes several changes to the deployment configuration, metrics collection, and scheduling logic. The most important changes include updating metrics collection to include waiting adapters, and implementing a new pod selection strategy that balances load while considering model affinity.

Scheduling Logic Enhancements:

  • pkg/epp/scheduling/filter.go: Replaced the loRAAffinityPredicate function with a new loRASoftAffinityPredicate function that prioritizes pods with existing model affinity while allowing for load balancing through randomization (as long as there is room to fit another adapter in the pod).
  • pkg/epp/scheduling/scheduler.go: Updated the scheduling configuration to use the new loRASoftAffinityPredicate function and increased the queueingThresholdLoRA value from 50 to 128. Added a new loraAffinityThreshold constant to indicate the probability of preferring pods with model affinity. [1] [2] [3]

Deployment Configuration Changes:

Metrics Collection Updates:

  • pkg/epp/backend/vllm/metrics.go: Added a new metric LoraRequestInfoWaitingAdaptersMetricName and updated the promToPodMetrics and getLatestLoraMetric functions to handle waiting adapters. Also pick the previous running + waiting adapters if there are no current running or waiting adapters [1] [2] [3]

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @kaushikmitr. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 27, 2025
Copy link

netlify bot commented Feb 27, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit b5d7f8f
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67c7403f524362000864c096
😎 Deploy Preview https://deploy-preview-423--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 27, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 27, 2025
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look at the algorithm change yet, left a couple of quick comments.

- "--lora-modules"
- '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
- '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}'
env:
- name: VLLM_USE_V1
value: "1"
Copy link
Contributor

@ahg-g ahg-g Feb 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The released vllm version doesn't support our metrics yet, right? if so, then we can't use it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is why the tests are failing. I will switch back to V0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that is, the integration test doesn't use this deployment yaml.

I think the test is failing because this PR introduces some randomness to the selection.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The algorithm is not using the waiting_lora_adapters metric, right?

// The value of 50 is arrived heuristicically based on experiments.
queueingThresholdLoRA = 50
// The value of 128 is arrived heuristicically based on experiments.
queueingThresholdLoRA = 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this configurable perhaps via a flag for now. Different environments will likely need different thresholds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather levarage this to make this configurable. #16

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we have time to do API change for the next release. Given we already had to change it on different accelerator types, it's important to have this knob configurable. Exposing it as a flag seems straightforward and gives us time to gather feedback on this before making an API change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took at look, iiuc, adding this flag is not straightforward, the way scheduler is written. If its needed for next release would rather have it in another PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defining a flag for each parameter is tedious, we can use a versioned configuration file, this is called ComponentConfig, ideally we do that for #383

Here is JobSet's config file as an example: https://github.com/kubernetes-sigs/jobset/tree/main/api/config/v1alpha1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because setting an env var is not validated in general (i.e., if you set an env var that the binary doesn't care about, nothing will happen), while with flags it is more strict.

It is ugly either way, lack of validation is bad, but I would rather prioritize asserting that this is a temporary knob that will likely either not exist in the future or be set via a different mechanism.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to not block this PR on adding the env vars, and do that as a followup for this and other algorithm parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me. I am OK to lgtm once the integration test is fixed, and a unit test for the new filter

Copy link
Contributor Author

@kaushikmitr kaushikmitr Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"but we need to evolve it in a way that the algorithm offers self tuning for the most part.": My concern is that we may never have a truly model-server configuration agnostic load balancing algorithm. For example, vLLM exposes settings like max_num_seq, max_lora, max_lora_rank, and max_num_batched_tokens, and the optimal load balancing strategy will depend on these parameters—many of which may not be directly available to the Gateway. While we could choose to scrape them as needed, I’m not sure if that’s the best design choice. I believe there will always be some scheduling configuration parameters that require independent tuning. For now, using environment variables works, but in the long run, we might want to make them configurable as load balancing parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I am just not confident that the current set of parameters are the ones we will actually have moving forward. We are very early, and our benchmarking so far is limited to a few use cases, and so only when we have more benchmarks across different model sizes/accelerators/ datasets or deployed more widely in practice we will get the confidence of what knobs should be exposed.

queueingThresholdLoRA = 128
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable.
// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters.
loraAffinityThreshold = 0.999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have some insights to show why this is needed and why this value is picked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked it after some trail and error. This value worked well when we had skewed traffic for different adapters, helped spread out high QPS adapters while keeping low QPS adapters less spread out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to update the hermetic_test case "select active lora, low queue", given the new probabilistic behavior. You can set the pods without the requested lora adapter to have no room so they will never be picked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the current test might fail sometimes. I tried to make it probablistic but its more complicated than I thought. For now fixed it as you suggested by setting the pods without the requested lora adapter to have no room so they will never be picked

@kaushikmitr
Copy link
Contributor Author

The algorithm is not using the waiting_lora_adapters metric, right?

It is, we are now checking for both waiting + running to determine affinity

@kaushikmitr
Copy link
Contributor Author

/retest

@@ -47,3 +47,5 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro
requested adapter. Example: `"max_lora": "8"`.
* `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"`
* `waiting_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU
memory and ready to serve requests. Example: `"waiting_lora_adapters": "adapter1, adapter2"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the docs, this reads exactly the same as the running one

@ahg-g
Copy link
Contributor

ahg-g commented Mar 1, 2025

/approve
/hold

Thanks a lot, this is great, leaving it to Cong to lgtm.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, kaushikmitr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2025
queueingThresholdLoRA = 128
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable.
// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters.
loraAffinityThreshold = 0.999
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we need to update the hermetic_test case "select active lora, low queue", given the new probabilistic behavior. You can set the pods without the requested lora adapter to have no room so they will never be picked.

// The value of 50 is arrived heuristicically based on experiments.
queueingThresholdLoRA = 50
// The value of 128 is arrived heuristicically based on experiments.
queueingThresholdLoRA = 128
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new "config" CRD? If we think about heterogeneous pools in the future, then we will need different configurations per "segment" of the pool. Though by carefully design the semantics such as the default config for the pool, and each segment can override, we can still be future proof. But we need a bit more discussion on this. Will create an issue and raise this in the community meeting.

For now, I still think adding a flag is the fastest way to unblock this.

adding this flag is not straightforward, the way scheduler is written.
If plumbing is too cumbersome, cou can define this as a flag in the current file, and then in main.go just reference it like _ = scheduler.LoraAffinityThreshold, this should work.

@ahg-g ahg-g mentioned this pull request Mar 1, 2025
33 tasks
Copy link
Contributor

@liu-cong liu-cong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm with a few nits. Feel free to ping me to lgtm again if you want do address them

@@ -142,6 +142,7 @@ func TestKubeInferenceModelRequest(t *testing.T) {
KVCacheUsagePercent: 0.2,
ActiveModels: map[string]int{
"foo": 1,
"bar": 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can you explain this in the comment line 121 that because no pods have room for new lora, therefore the pod with lora affinity will always be picked?

logger := logutil.NewTestLogger()

const (
testModelName = "test-model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not used?

const (
testModelName = "test-model"
testAffinityModel = "test-affinity-model"
numIterations = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having to run 10k times seems a lot. How long does it take? Any chance we can reduce it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it takes 0.29 seconds, I had it 1000 initially which took 0.16 seconds. 1000 is fine but given the probabiliy i have is very skewed 99.9%, 10000 gives assigns some pods to both cases (~9990, 100)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issue is the 99.9% threshold, you can change the percentage in tests to make it easier to test, like 90%

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup thats a bit complicated though (for now) as the threshold is defined in the scheduler as const and the filter template does not take as input any arbitrary parameter. I have to update the const for the test and reset it back. I would rather test it against whats already set in the scheduler, 0.29 sec should not be too bad, if it is I can change it go 1000 (the test would still work, since it has a tolerance for 5%)

}

// Identify if the returned pod is the affinity pod or available pod
if _, exists := result[0].ActiveModels[testAffinityModel]; exists {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I recommend checking the pod name directly, it's more readable and reliable. (imagine in the future the ActiveModels is modified during tests for some reason).

@liu-cong
Copy link
Contributor

liu-cong commented Mar 4, 2025

/lgtm

Thanks @kaushikmitr!

This PR introduces a unit test based on probabilities. We chatted offline with @kaushikmitr that with the large # of iterations, no flakiness was found with 100 test runs. So despite the test is not deterministic, in practice the test should be very reliable.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2025
@liu-cong
Copy link
Contributor

liu-cong commented Mar 4, 2025

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2025
@k8s-ci-robot k8s-ci-robot merged commit dfe8d9c into kubernetes-sigs:main Mar 4, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants