-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduling changes for lora affinity load balancing #423
Conversation
Hi @kaushikmitr. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/ok-to-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look at the algorithm change yet, left a couple of quick comments.
- "--lora-modules" | ||
- '{"name": "tweet-summary-0", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' | ||
- '{"name": "tweet-summary-1", "path": "vineetsharma/qlora-adapter-Llama-2-7b-hf-TweetSumm", "base_model_name": "llama-2"}' | ||
env: | ||
- name: VLLM_USE_V1 | ||
value: "1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The released vllm version doesn't support our metrics yet, right? if so, then we can't use it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is why the tests are failing. I will switch back to V0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that is, the integration test doesn't use this deployment yaml.
I think the test is failing because this PR introduces some randomness to the selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The algorithm is not using the waiting_lora_adapters metric, right?
// The value of 50 is arrived heuristicically based on experiments. | ||
queueingThresholdLoRA = 50 | ||
// The value of 128 is arrived heuristicically based on experiments. | ||
queueingThresholdLoRA = 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this configurable perhaps via a flag for now. Different environments will likely need different thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather levarage this to make this configurable. #16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have time to do API change for the next release. Given we already had to change it on different accelerator types, it's important to have this knob configurable. Exposing it as a flag seems straightforward and gives us time to gather feedback on this before making an API change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took at look, iiuc, adding this flag is not straightforward, the way scheduler is written. If its needed for next release would rather have it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining a flag for each parameter is tedious, we can use a versioned configuration file, this is called ComponentConfig, ideally we do that for #383
Here is JobSet's config file as an example: https://github.com/kubernetes-sigs/jobset/tree/main/api/config/v1alpha1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because setting an env var is not validated in general (i.e., if you set an env var that the binary doesn't care about, nothing will happen), while with flags it is more strict.
It is ugly either way, lack of validation is bad, but I would rather prioritize asserting that this is a temporary knob that will likely either not exist in the future or be set via a different mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to not block this PR on adding the env vars, and do that as a followup for this and other algorithm parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me. I am OK to lgtm once the integration test is fixed, and a unit test for the new filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"but we need to evolve it in a way that the algorithm offers self tuning for the most part.": My concern is that we may never have a truly model-server configuration agnostic load balancing algorithm. For example, vLLM exposes settings like max_num_seq, max_lora, max_lora_rank, and max_num_batched_tokens, and the optimal load balancing strategy will depend on these parameters—many of which may not be directly available to the Gateway. While we could choose to scrape them as needed, I’m not sure if that’s the best design choice. I believe there will always be some scheduling configuration parameters that require independent tuning. For now, using environment variables works, but in the long run, we might want to make them configurable as load balancing parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I am just not confident that the current set of parameters are the ones we will actually have moving forward. We are very early, and our benchmarking so far is limited to a few use cases, and so only when we have more benchmarks across different model sizes/accelerators/ datasets or deployed more widely in practice we will get the confidence of what knobs should be exposed.
queueingThresholdLoRA = 128 | ||
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable. | ||
// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters. | ||
loraAffinityThreshold = 0.999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have some insights to show why this is needed and why this value is picked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked it after some trail and error. This value worked well when we had skewed traffic for different adapters, helped spread out high QPS adapters while keeping low QPS adapters less spread out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need to update the hermetic_test case "select active lora, low queue"
, given the new probabilistic behavior. You can set the pods without the requested lora adapter to have no room so they will never be picked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the current test might fail sometimes. I tried to make it probablistic but its more complicated than I thought. For now fixed it as you suggested by setting the pods without the requested lora adapter to have no room so they will never be picked
It is, we are now checking for both waiting + running to determine affinity |
/retest |
@@ -47,3 +47,5 @@ The model server MUST expose the following LoRA adapter metrics via the same Pro | |||
requested adapter. Example: `"max_lora": "8"`. | |||
* `running_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU | |||
memory and ready to serve requests. Example: `"running_lora_adapters": "adapter1, adapter2"` | |||
* `waiting_lora_adapters`: A comma separated list of adapters that are currently loaded in GPU | |||
memory and ready to serve requests. Example: `"waiting_lora_adapters": "adapter1, adapter2"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the docs, this reads exactly the same as the running one
/approve Thanks a lot, this is great, leaving it to Cong to lgtm. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, kaushikmitr The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
queueingThresholdLoRA = 128 | ||
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16) Make this configurable. | ||
// loraAffinityThreshold indicates the probability with which we prefer a pod with LoRA affinity over a pod without but having room to fit more LoRA adapters. | ||
loraAffinityThreshold = 0.999 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need to update the hermetic_test case "select active lora, low queue"
, given the new probabilistic behavior. You can set the pods without the requested lora adapter to have no room so they will never be picked.
// The value of 50 is arrived heuristicically based on experiments. | ||
queueingThresholdLoRA = 50 | ||
// The value of 128 is arrived heuristicically based on experiments. | ||
queueingThresholdLoRA = 128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new "config" CRD? If we think about heterogeneous pools in the future, then we will need different configurations per "segment" of the pool. Though by carefully design the semantics such as the default config for the pool, and each segment can override, we can still be future proof. But we need a bit more discussion on this. Will create an issue and raise this in the community meeting.
For now, I still think adding a flag is the fastest way to unblock this.
adding this flag is not straightforward, the way scheduler is written.
If plumbing is too cumbersome, cou can define this as a flag in the current file, and then in main.go just reference it like_ = scheduler.LoraAffinityThreshold
, this should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm with a few nits. Feel free to ping me to lgtm again if you want do address them
@@ -142,6 +142,7 @@ func TestKubeInferenceModelRequest(t *testing.T) { | |||
KVCacheUsagePercent: 0.2, | |||
ActiveModels: map[string]int{ | |||
"foo": 1, | |||
"bar": 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you explain this in the comment line 121 that because no pods have room for new lora, therefore the pod with lora affinity will always be picked?
logger := logutil.NewTestLogger() | ||
|
||
const ( | ||
testModelName = "test-model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not used?
const ( | ||
testModelName = "test-model" | ||
testAffinityModel = "test-affinity-model" | ||
numIterations = 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having to run 10k times seems a lot. How long does it take? Any chance we can reduce it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it takes 0.29 seconds, I had it 1000 initially which took 0.16 seconds. 1000 is fine but given the probabiliy i have is very skewed 99.9%, 10000 gives assigns some pods to both cases (~9990, 100)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the issue is the 99.9% threshold, you can change the percentage in tests to make it easier to test, like 90%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup thats a bit complicated though (for now) as the threshold is defined in the scheduler as const and the filter template does not take as input any arbitrary parameter. I have to update the const for the test and reset it back. I would rather test it against whats already set in the scheduler, 0.29 sec should not be too bad, if it is I can change it go 1000 (the test would still work, since it has a tolerance for 5%)
} | ||
|
||
// Identify if the returned pod is the affinity pod or available pod | ||
if _, exists := result[0].ActiveModels[testAffinityModel]; exists { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I recommend checking the pod name directly, it's more readable and reliable. (imagine in the future the ActiveModels is modified during tests for some reason).
/lgtm Thanks @kaushikmitr! This PR introduces a unit test based on probabilities. We chatted offline with @kaushikmitr that with the large # of iterations, no flakiness was found with 100 test runs. So despite the test is not deterministic, in practice the test should be very reliable. |
/unhold |
This pull request includes several changes to the deployment configuration, metrics collection, and scheduling logic. The most important changes include updating metrics collection to include waiting adapters, and implementing a new pod selection strategy that balances load while considering model affinity.
Scheduling Logic Enhancements:
pkg/epp/scheduling/filter.go
: Replaced theloRAAffinityPredicate
function with a newloRASoftAffinityPredicate
function that prioritizes pods with existing model affinity while allowing for load balancing through randomization (as long as there is room to fit another adapter in the pod).pkg/epp/scheduling/scheduler.go
: Updated the scheduling configuration to use the newloRASoftAffinityPredicate
function and increased thequeueingThresholdLoRA
value from 50 to 128. Added a newloraAffinityThreshold
constant to indicate the probability of preferring pods with model affinity. [1] [2] [3]Deployment Configuration Changes:
config/manifests/vllm/deployment.yaml
: Added new command-line arguments for--compilation-config
,--max-num-seqs
, and--max-lora-rank
. Added a new environment variableVLLM_USE_V1
. [1] [2]Metrics Collection Updates:
pkg/epp/backend/vllm/metrics.go
: Added a new metricLoraRequestInfoWaitingAdaptersMetricName
and updated thepromToPodMetrics
andgetLatestLoraMetric
functions to handle waiting adapters. Also pick the previous running + waiting adapters if there are no current running or waiting adapters [1] [2] [3]