-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Metrics] Add average kv cache and waiting queue size metrics for inference pool #304
[Metrics] Add average kv cache and waiting queue size metrics for inference pool #304
Conversation
Hi @JeffLuoo. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
/ok-to-test |
[]string{"name"}, | ||
) | ||
|
||
inferencePoolAvgQueueSize = compbasemetrics.NewGaugeVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raywainman we are reporting average queue length across model servers, what alternatives would you suggest to use this for HPA? Can HPA consume a distribution and do the aggregation on its end and so the user have more flexibility on how to aggregate?
/cc @smarterclayton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HPA can't consume a distribution directly today unless we put a Prometheus adapter infront of the metric and convert it to a direct gauge metric (which is doable). For example you could do something like "Get 90%ile queue size over the last 5 minutes" this way. Do we anticipate that being useful?
If so we could maybe emit both?
One simple gauge metric emitting the instantaneous average queue size across all model servers and another metric with a distribution.
@JeffLuoo what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our benchmarking, we scrape gauge metrics for cache utilization and queue size. Let's discuss whether distribution for queue size is more helpful or other metrics from model servers are more helpful.
Inference pool metrics are calculated from metrics from model servers (vLLM in current implementation) directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's target to have new metrics added in a follow-up CL (e.g. percentiles) to unblock this CL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds great, made #306 to track
pkg/ext-proc/backend/provider.go
Outdated
podTotalCount++ | ||
if val, ok := p.podMetrics.Load(pod.Name); ok { | ||
pm := val.(*PodMetrics) | ||
kvCacheTotal += pm.KVCacheUsagePercent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a high level thought....
As an optimization, would we ever consider doing this calculation as part of the actual logic in
func leastKVCacheFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) { |
Then we are computing these metrics directly in-line with the endpoint picking logic and could get the absolute freshest value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, one thing to consider here is that the probing will be frequent, and currently the podMetrics map only reflects the latest probed value. We should consider aggregating over a time window to avoid oscillations. @liu-cong @kaushikmitr did we think about this in the context of the endpoint picking algorithm (i.e., using the absolute last value vs aggregation over a window)?
b167c1e
to
2f552c1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, this lays out a good foundation and we can build on this by adding more metrics over time.
/lgtm
2f552c1
to
801e18c
Compare
Looks great! Thanks for this, really cool to see metrics at the pool level coming out. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JeffLuoo, kfswain The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
No description provided.