Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix inference extension not correctly scrape pod metrics #366

Merged
merged 1 commit into from
Feb 19, 2025

Conversation

Kuromesi
Copy link
Contributor

resolve #365

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 19, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @Kuromesi. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 19, 2025
Copy link

netlify bot commented Feb 19, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 4e0f9d7
🔍 Latest deploy log https://app.netlify.com/sites/gateway-api-inference-extension/deploys/67b5f99f65b0fd0008438f4a
😎 Deploy Preview https://deploy-preview-366--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@@ -45,7 +45,7 @@ func (p *PodMetricsClientImpl) FetchMetrics(

// Currently the metrics endpoint is hard-coded, which works with vLLM.
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16): Consume this from InferencePool config.
url := fmt.Sprintf("http://%s/metrics", existing.Address)
url := existing.BuildScrapeEndpoint()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just pass in the port number to the function and have the caller get the port number from the pool directly instead of adding the port to every pod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but I did like this for considerations as:

  • I think during the lifetime of a pod, the scrape port or path should not change, and I think it is reasonable to store this info in PodMetrics.
  • If we pass the port to FetchMetrics, we may call datastore.PoolGet() to get the targetPort, this will try to acuire the pool lock every 50 milliseconds, I think this may affect the performance (but not tested).

I can rewrite the code if you think this considerations are not reasonable. 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for looking into it! btw #363 will merge first, so we will need to rebase this PR unfortunately since the ext-proc/backend/types.go will move to ext-proc/datastore/types.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for you comments!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR merged now, pls rebase when you get a chance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw the lock is a read lock, so it shouldn't add any overhead, but this is fine too.

@ahg-g
Copy link
Contributor

ahg-g commented Feb 19, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 19, 2025
@kfswain
Copy link
Collaborator

kfswain commented Feb 19, 2025

This looks great! I like abstracting the scrape path as well!
Thanks for the catch also. Will LGTM when we get the rebase

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 19, 2025
@kfswain
Copy link
Collaborator

kfswain commented Feb 19, 2025

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 19, 2025
Copy link
Contributor

@ahg-g ahg-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Thanks again, this highlights the limitation of our e2e testing! I had verified the metrics in the ; but this change that moved the port out of the address was done in a second to last commit in that PR and so wasn't manually verified :(

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, Kuromesi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2025
@k8s-ci-robot k8s-ci-robot merged commit 0f67df5 into kubernetes-sigs:main Feb 19, 2025
8 checks passed
rramkumar1 pushed a commit to rramkumar1/gateway-api-inference-extension that referenced this pull request Mar 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

inference extension not correctly scrape pod metrics
4 participants