-
Notifications
You must be signed in to change notification settings - Fork 1.2k
higher buckets for workqueue histograms #2625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/kind support |
/kind feature Would increasing the bucket size be beneficial to provide that info if that limit is never reached? I am not certain why these sizes were picked but 1000s seems a bit long and arbitrary to define for something that may not ever hit that limit. I'm interested in understanding why having these upper bounds would be helpful beyond the need to capture the high latency in the reconciliation requests. |
You'd want the last defined bucket to be past the upper bound of what can be considered "reasonable" values to be able to estimate where the top end of the values you observe are (since monitoring system work by interpolating between bucket bounds to estimate percentiles). The +Inf bucket captures everything, but isn't useful for interpolation since it's infinity. We regularly see "normal" queue latency on the order of 1-2 minutes for our larger clusters with controllers that need to make many external calls (not a great situation to be in) so placing the upper bound at the next step up (1000s) seems reasonable (and hopefully nothing normally operates at 1000s latency). Aside: If prometheus ever stabilizes their native histograms (true expontnial), then we wouldn't need to choose these bucket bounds so carefully. |
I don't believe this will have any other implications where this would ruin anything from before, so I feel like this is a reasonable requests. @sbueringer WDYT? With this proposal to increase the buckets, pending buy-in from the maintainers, do you want to do the PR? @seankhliao |
Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s. Add buckets that cover the range, allowing metric backends to interpolate percentile estimates properly. Fixes kubernetes-sigs#2625
sure, I've sent #2638 |
/close Resolved #2638 |
@troy0820: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thank you for this change :) Our buckets were pinned at 10s and not useful enough for us to create alerts on their own. |
The workqueue subsystem exposes several metrics, including the
workqueue_queue_duration_seconds
andworkqueue_work_duration_seconds
buckets.These currently have exponential buckets going from 1ns -> 10s.
controller-runtime/pkg/metrics/workqueue.go
Lines 53 to 65 in 1b80b96
I think these should go up a further 2 steps to include 100s and 1000s as buckets (and maybe drop the lowest 2 buckets if we're concerned about the number), especially for
workqueue_queue_duration_seconds
as a stuck / poorly implemented controller can have quite high latency in processing reconciliation requestsThe text was updated successfully, but these errors were encountered: