-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix inference extension not correctly scrape pod metrics #366
Conversation
Hi @Kuromesi. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
@@ -45,7 +45,7 @@ func (p *PodMetricsClientImpl) FetchMetrics( | |||
|
|||
// Currently the metrics endpoint is hard-coded, which works with vLLM. | |||
// TODO(https://github.com/kubernetes-sigs/gateway-api-inference-extension/issues/16): Consume this from InferencePool config. | |||
url := fmt.Sprintf("http://%s/metrics", existing.Address) | |||
url := existing.BuildScrapeEndpoint() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just pass in the port number to the function and have the caller get the port number from the pool directly instead of adding the port to every pod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes but I did like this for considerations as:
- I think during the lifetime of a pod, the scrape port or path should not change, and I think it is reasonable to store this info in PodMetrics.
- If we pass the port to
FetchMetrics
, we may calldatastore.PoolGet()
to get the targetPort, this will try to acuire the pool lock every 50 milliseconds, I think this may affect the performance (but not tested).
I can rewrite the code if you think this considerations are not reasonable. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, thanks for looking into it! btw #363 will merge first, so we will need to rebase this PR unfortunately since the ext-proc/backend/types.go
will move to ext-proc/datastore/types.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks for you comments!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR merged now, pls rebase when you get a chance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw the lock is a read lock, so it shouldn't add any overhead, but this is fine too.
/ok-to-test |
This looks great! I like abstracting the scrape path as well! |
Signed-off-by: Kuromesi <[email protected]>
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Thanks again, this highlights the limitation of our e2e testing! I had verified the metrics in the ; but this change that moved the port out of the address was done in a second to last commit in that PR and so wasn't manually verified :(
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, Kuromesi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…sigs#366) Signed-off-by: Kuromesi <[email protected]>
resolve #365