Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduling changes for lora affinity load balancing #423
scheduling changes for lora affinity load balancing #423
Changes from all commits
ad15e84
b9f57c5
9e94fd9
2b934d0
2d3a3bb
323e141
41ec5b8
2991617
be3ce8b
71b95e6
db5fcdb
d6093ce
b5d7f8f
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having to run 10k times seems a lot. How long does it take? Any chance we can reduce it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it takes 0.29 seconds, I had it 1000 initially which took 0.16 seconds. 1000 is fine but given the probabiliy i have is very skewed 99.9%, 10000 gives assigns some pods to both cases (~9990, 100)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the issue is the 99.9% threshold, you can change the percentage in tests to make it easier to test, like 90%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup thats a bit complicated though (for now) as the threshold is defined in the scheduler as const and the filter template does not take as input any arbitrary parameter. I have to update the const for the test and reset it back. I would rather test it against whats already set in the scheduler, 0.29 sec should not be too bad, if it is I can change it go 1000 (the test would still work, since it has a tolerance for 5%)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I recommend checking the pod name directly, it's more readable and reliable. (imagine in the future the ActiveModels is modified during tests for some reason).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this configurable perhaps via a flag for now. Different environments will likely need different thresholds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather levarage this to make this configurable. #16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have time to do API change for the next release. Given we already had to change it on different accelerator types, it's important to have this knob configurable. Exposing it as a flag seems straightforward and gives us time to gather feedback on this before making an API change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took at look, iiuc, adding this flag is not straightforward, the way scheduler is written. If its needed for next release would rather have it in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining a flag for each parameter is tedious, we can use a versioned configuration file, this is called ComponentConfig, ideally we do that for #383
Here is JobSet's config file as an example: https://github.com/kubernetes-sigs/jobset/tree/main/api/config/v1alpha1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because setting an env var is not validated in general (i.e., if you set an env var that the binary doesn't care about, nothing will happen), while with flags it is more strict.
It is ugly either way, lack of validation is bad, but I would rather prioritize asserting that this is a temporary knob that will likely either not exist in the future or be set via a different mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to not block this PR on adding the env vars, and do that as a followup for this and other algorithm parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me. I am OK to lgtm once the integration test is fixed, and a unit test for the new filter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"but we need to evolve it in a way that the algorithm offers self tuning for the most part.": My concern is that we may never have a truly model-server configuration agnostic load balancing algorithm. For example, vLLM exposes settings like max_num_seq, max_lora, max_lora_rank, and max_num_batched_tokens, and the optimal load balancing strategy will depend on these parameters—many of which may not be directly available to the Gateway. While we could choose to scrape them as needed, I’m not sure if that’s the best design choice. I believe there will always be some scheduling configuration parameters that require independent tuning. For now, using environment variables works, but in the long run, we might want to make them configurable as load balancing parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I am just not confident that the current set of parameters are the ones we will actually have moving forward. We are very early, and our benchmarking so far is limited to a few use cases, and so only when we have more benchmarks across different model sizes/accelerators/ datasets or deployed more widely in practice we will get the confidence of what knobs should be exposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have some insights to show why this is needed and why this value is picked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I picked it after some trail and error. This value worked well when we had skewed traffic for different adapters, helped spread out high QPS adapters while keeping low QPS adapters less spread out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we need to update the hermetic_test case
"select active lora, low queue"
, given the new probabilistic behavior. You can set the pods without the requested lora adapter to have no room so they will never be picked.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the current test might fail sometimes. I tried to make it probablistic but its more complicated than I thought. For now fixed it as you suggested by setting the pods without the requested lora adapter to have no room so they will never be picked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can you explain this in the comment line 121 that because no pods have room for new lora, therefore the pod with lora affinity will always be picked?