-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor scheduler to make it more readable #645
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: liu-cong The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
a5035ff
to
df2cfb5
Compare
logger.V(logutil.DEBUG).Info(fmt.Sprintf("Scheduling a request. Metrics: %+v", sCtx.podsSnapshot)) | ||
|
||
var filter Filter | ||
if req.Critical { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's much cleaner this way to show how we handle critical vs. sheddable differently, than the previous filter.
|
||
if _, exists := pod.GetMetrics().ActiveModels[req.ResolvedTargetModel]; exists { | ||
if active || waiting { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we consider both active and waiting as lora affinity. This refactor just makes it super clear.
// spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to | ||
// a single pod. This gave good performance in our initial benchmarking results in the scenario | ||
// where # of lora slots > # of lora adapters. | ||
func lowLoRACostPredicate(req *LLMRequest, pod backendmetrics.PodMetrics) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original lora affinity filter. This one considered a pod with an active lora adapter equally with a pod that has space to load a new lora.
Later on we introduced the loRASoftAffinityFilter
below, which generally favors pods with active loras, but with a probability to choose a pod with space to load a new lora, to avoid hot spots.
The new affinity filter is considered better than the original one. There is no clear reason why we want to keep the original one.
Ack! Will take a look later today, this looks awesome at a glance. Thanks! |
If I read the first chart correctly the QPS goes all the way up to 3000? 😮 |
Not the real QPS ... I set the QPS to 300, with 10 adapters, so that's how the 3000 is calculated. However, the LPG tool isn't able to send that high QPS though. If you look at the |
This refactor does the following:
ActiveModels
andWaitingModels
that map to the running and waiting adapters metric. Previously we combine both asActiveModels
. This change makes it clear.scheduling.Context
object that holds the contextual info during a request scheduling cycle. This has 2 benefits: a) Any contextual info can be added to this struct instead of modifying the scheduler interface, making extending the scheduler easier (will soon need this for prefix caching implementation). b) Create a snapshot of the pod metrics for the scheduling cycle, to reduce concurrent access to the shared datastore, as well as provide a consistent view of the pods and metrics during scheduling. This makes debugging easier.simpleFilter
which contains the user readable filter name plus filter function. Making the filter chain composition much cleaner.Benchmarks
I ran benchmarks using the LPG tool, with 10 lora adapters sharing the same traffic, and 8 vllm replicas running on H100 with max-lora=4.
EPP metrics and vLLM metrics between the baseline and refactor
baseline


Refactor

