Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor scheduler to make it more readable #645

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

liu-cong
Copy link
Contributor

@liu-cong liu-cong commented Apr 3, 2025

This refactor does the following:

  1. Explicitly define ActiveModels and WaitingModels that map to the running and waiting adapters metric. Previously we combine both as ActiveModels. This change makes it clear.
  2. Create a scheduling.Context object that holds the contextual info during a request scheduling cycle. This has 2 benefits: a) Any contextual info can be added to this struct instead of modifying the scheduler interface, making extending the scheduler easier (will soon need this for prefix caching implementation). b) Create a snapshot of the pod metrics for the scheduling cycle, to reduce concurrent access to the shared datastore, as well as provide a consistent view of the pods and metrics during scheduling. This makes debugging easier.
  3. Created a simpleFilter which contains the user readable filter name plus filter function. Making the filter chain composition much cleaner.
  4. This is the only change of logic. Previously we had 2 slightly different LoRA affinity filters, one is considered the "main" lora affinity filter, the other one is only used when all pods are significantly queued up (>128 waiting requests). As far as I can tell, there was no clear reason why we needed both. The reason was because we added the "main" one later to account for long queues. In this PR I consolidated to the "main" lora affinity filter. And benchmark results showed the refactored version had almost identical results with HEAD.

Benchmarks

I ran benchmarks using the LPG tool, with 10 lora adapters sharing the same traffic, and 8 vllm replicas running on H100 with max-lora=4.

  • LPG reported data between baseline and refactor:

image

  • EPP metrics and vLLM metrics between the baseline and refactor

    • baseline
      image
      image

    • Refactor
      image
      image

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: liu-cong
Once this PR has been reviewed and has the lgtm label, please assign ahg-g for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and danehans April 3, 2025 05:27
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 3, 2025
@liu-cong
Copy link
Contributor Author

liu-cong commented Apr 3, 2025

cc @kaushikmitr @kfswain

Copy link

netlify bot commented Apr 3, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 87162b4
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67f067432baed800080c0429
😎 Deploy Preview https://deploy-preview-645--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@liu-cong liu-cong force-pushed the refactor branch 2 times, most recently from a5035ff to df2cfb5 Compare April 3, 2025 05:33
logger.V(logutil.DEBUG).Info(fmt.Sprintf("Scheduling a request. Metrics: %+v", sCtx.podsSnapshot))

var filter Filter
if req.Critical {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's much cleaner this way to show how we handle critical vs. sheddable differently, than the previous filter.


if _, exists := pod.GetMetrics().ActiveModels[req.ResolvedTargetModel]; exists {
if active || waiting {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we consider both active and waiting as lora affinity. This refactor just makes it super clear.

// spreading the load of a LoRA adapter across multiple pods, avoiding "pinning" all requests to
// a single pod. This gave good performance in our initial benchmarking results in the scenario
// where # of lora slots > # of lora adapters.
func lowLoRACostPredicate(req *LLMRequest, pod backendmetrics.PodMetrics) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original lora affinity filter. This one considered a pod with an active lora adapter equally with a pod that has space to load a new lora.

Later on we introduced the loRASoftAffinityFilter below, which generally favors pods with active loras, but with a probability to choose a pod with space to load a new lora, to avoid hot spots.

The new affinity filter is considered better than the original one. There is no clear reason why we want to keep the original one.

@kfswain
Copy link
Collaborator

kfswain commented Apr 3, 2025

Ack! Will take a look later today, this looks awesome at a glance. Thanks!

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 4, 2025
@kaushikmitr
Copy link
Contributor

If I read the first chart correctly the QPS goes all the way up to 3000? 😮

@liu-cong
Copy link
Contributor Author

liu-cong commented Apr 4, 2025

If I read the first chart correctly the QPS goes all the way up to 3000? 😮

Not the real QPS ... I set the QPS to 300, with 10 adapters, so that's how the 3000 is calculated. However, the LPG tool isn't able to send that high QPS though. If you look at the InferenceModel metrics on EPP, the real QPS was about 30 requests/s for each adapter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants