Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics] Add average kv cache and waiting queue size metrics for inference pool #304

Merged
merged 1 commit into from
Feb 10, 2025

Conversation

JeffLuoo
Copy link
Contributor

@JeffLuoo JeffLuoo commented Feb 7, 2025

No description provided.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 7, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and kfswain February 7, 2025 16:10
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 7, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @JeffLuoo. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@JeffLuoo
Copy link
Contributor Author

JeffLuoo commented Feb 7, 2025

cc for review:

@courageJ @liu-cong

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 7, 2025
Copy link

netlify bot commented Feb 7, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 801e18c
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67aa1e3c1c99a30008f48ca5
😎 Deploy Preview https://deploy-preview-304--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 7, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 7, 2025
[]string{"name"},
)

inferencePoolAvgQueueSize = compbasemetrics.NewGaugeVec(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raywainman we are reporting average queue length across model servers, what alternatives would you suggest to use this for HPA? Can HPA consume a distribution and do the aggregation on its end and so the user have more flexibility on how to aggregate?

/cc @smarterclayton

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HPA can't consume a distribution directly today unless we put a Prometheus adapter infront of the metric and convert it to a direct gauge metric (which is doable). For example you could do something like "Get 90%ile queue size over the last 5 minutes" this way. Do we anticipate that being useful?

If so we could maybe emit both?

One simple gauge metric emitting the instantaneous average queue size across all model servers and another metric with a distribution.

@JeffLuoo what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our benchmarking, we scrape gauge metrics for cache utilization and queue size. Let's discuss whether distribution for queue size is more helpful or other metrics from model servers are more helpful.

Inference pool metrics are calculated from metrics from model servers (vLLM in current implementation) directly.

Copy link
Contributor Author

@JeffLuoo JeffLuoo Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's target to have new metrics added in a follow-up CL (e.g. percentiles) to unblock this CL.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds great, made #306 to track

podTotalCount++
if val, ok := p.podMetrics.Load(pod.Name); ok {
pm := val.(*PodMetrics)
kvCacheTotal += pm.KVCacheUsagePercent

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a high level thought....

As an optimization, would we ever consider doing this calculation as part of the actual logic in

func leastKVCacheFilterFunc(req *LLMRequest, pods []*backend.PodMetrics) ([]*backend.PodMetrics, error) {
?

Then we are computing these metrics directly in-line with the endpoint picking logic and could get the absolute freshest value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, one thing to consider here is that the probing will be frequent, and currently the podMetrics map only reflects the latest probed value. We should consider aggregating over a time window to avoid oscillations. @liu-cong @kaushikmitr did we think about this in the context of the endpoint picking algorithm (i.e., using the absolute last value vs aggregation over a window)?

@JeffLuoo JeffLuoo force-pushed the inference-pool-metrics branch from b167c1e to 2f552c1 Compare February 7, 2025 19:41
@JeffLuoo JeffLuoo requested review from raywainman and ahg-g February 7, 2025 19:42
Copy link

@raywainman raywainman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, this lays out a good foundation and we can build on this by adding more metrics over time.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 7, 2025
@JeffLuoo JeffLuoo force-pushed the inference-pool-metrics branch from 2f552c1 to 801e18c Compare February 10, 2025 15:41
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025
@JeffLuoo JeffLuoo requested a review from raywainman February 10, 2025 15:42
@kfswain
Copy link
Collaborator

kfswain commented Feb 10, 2025

Looks great! Thanks for this, really cool to see metrics at the pool level coming out.

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JeffLuoo, kfswain

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 10, 2025
@k8s-ci-robot k8s-ci-robot merged commit 7149624 into kubernetes-sigs:main Feb 10, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants