-
Notifications
You must be signed in to change notification settings - Fork 1.2k
✨ Add 100/1000s buckets for prometheus workqueue histograms #2638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Add 100/1000s buckets for prometheus workqueue histograms #2638
Conversation
Welcome @seankhliao! |
Hi @seankhliao. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s. Add buckets that cover the range, allowing metric backends to interpolate percentile estimates properly.
8175828
to
f4fe233
Compare
/ok-to-test |
/retitle ✨ Add 100/1000s buckets for prometheus workqueue histograms |
It can't merge without approval but I want the maintainers to be able to take a look at this to get consensus around this solution. /hold |
Not really, no. This would be indicative of doing long-running and/or blocking operations in your Reconcile which you shouldn't do, use |
Or you have a poorly written vendor controller, or we just have a lot of work and we're starved of parallelism. Either way, I'd like to be able to observe how poorly it's performing beyond just 10s. |
Not opposed to this, @alvaroaleman ? Seems a small improvement? |
In the past I used controller-runtime/pkg/internal/controller/metrics/metrics.go Lines 51 to 56 in d10ae95
(Might be helpful retroactively in any case) In general no objection to adding additional buckets. I have no idea what the ideal buckets are to be honest. For what it's worth upstream still only has buckets until 10s: https://github.com/kubernetes/component-base/blob/master/metrics/prometheus/workqueue/metrics.go#L55-L69 |
For folks that are not familiar with metrics it can look pretty confusing that the metrics are capped at 10s (as it looks like the durations are never slower than that) |
I have no idea what is ideal as well. Initially in the issue, I wasn't trying to encourage more buckets as that may or may not impact performance, however the need for an upper bound outside of 10s seemed to be necessary.
However, however, to stay consistent with upstream, would this be beneficial to add and not use the other metrics as you proposed? |
The controller_runtime_reconcile_time_seconds metric only ~ matches one of the workqueue metrics. So it definitely makes sense for folks to also use the work queue metrics. They are also not 100% the same data, just (I think) one of them is very close. I think performance-wise it's probably fine. I guess it's just 20% more data for the two histogram metrics. As the number of workqueues in a controller is usually not that high that it should become a problem I think it's okay. (Not 100% sure if I understood "add and not use the other metrics as you proposed?" correctly) |
If we're concerned about data volumes, I think dropping the first 2 or 3 buckets might be ok? |
I discussed it with Alvaro. We're fine with the change for now. A larger issue we have is that all the metric configuration is done in /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer, seankhliao The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s. Add buckets that cover the range, allowing metric backends to interpolate percentile estimates properly.
For #2625