-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the filter to return multiple preferred pods instead of one; also fix metrics update bug #17
Conversation
555dbd5
to
e3a4c1e
Compare
/assign @ahg-g |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, what algorithm was used in those graphs? leastKVCache?
@kaushikmitr we should run a experiment for single use case, we are not showing that in https://docs.google.com/document/d/1lIG9iwpl9-Iaf3iSDfjDwjq5vaT3buBNJ677oBIBKQs/edit?tab=t.0#heading=h.arbqxiuvplvc
go func() { | ||
for { | ||
time.Sleep(5 * time.Second) | ||
klog.Infof("===DEBUG: Current Pods and metrics: %+v", p.AllPodMetrics()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not print this as part of the refresh loop above instead of spawning a separate go routine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The refresh loop is run much more frequently (20ms in my benchmark).
filtered = append(filtered, pod) | ||
} | ||
} | ||
return filtered, nil | ||
} | ||
|
||
// leastKVCacheFilterFunc finds the max and min KV cache of all pods, divides the whole range | ||
// (max-min) by the number of pods, and finds the pods that fall into the first range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the intuition behind setting the range this way compared to finding the least-k elements for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing in particular. I will try to compare a few different algos and tune them with different parameters to see the difference. But for now I don't have data to see which one is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment with the reason behind this. And added a TODO to compare with other strategies.
The current algo is least queuing -> LoRA affinity -> least KVCache. I also benchmarked single adapter. Current gateway algo performed better as well, though the improvement is smaller. |
…also fix metric update bug Returning multiple preferred pods improves the probability to match the next preferrence filter. Also updated log levels: - Level 1, 2: For non-spammy logs like initialization - Level 3+: For per request debug logs - Level 4+: For per loop logs like periodically scrape metrics
Nice, I assume most of the gains are obtained from the LoRA affinity, it would be interesting to evaluate this filter alone to set a baseline for kv-cache and queue length based routing. /approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, liu-cong The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The updated also allows picking multiple preferred pods instead of one, which improves efficiency when combining with other filters. If we only return one, it's more likely this pod is not preferred by other filtering criteria.
Also updated log levels:
I made some silly bugs in the initial implementation which caused some metric not getting updated. I was misled by my previous benchmark results, which had lots of errors in the external load balancer. This time I performed benchmarks within the cluster with no errors, and upgraded the benchmark tooling to https://github.com/GoogleCloudPlatform/ai-on-gke/tree/main/benchmarks/benchmark/tools/profile-generator.
The results show that the gateway constantly outperforms basic k8s service in terms of output token throughput and latency (served 10 adapters in 6 vLLM replicas, with max 4 adapters loaded in GPU)
I will share a more detailed benchmark analysis in a doc.