higher buckets for workqueue histograms #2625

seankhliao · 2023-12-18T21:27:19Z

The workqueue subsystem exposes several metrics, including the workqueue_queue_duration_seconds and workqueue_work_duration_seconds buckets.
These currently have exponential buckets going from 1ns -> 10s.

controller-runtime/pkg/metrics/workqueue.go

Lines 53 to 65 in 1b80b96

    
           latency = prometheus.NewHistogramVec(prometheus.HistogramOpts{ 
        
           	Subsystem: WorkQueueSubsystem, 
        
           	Name:      QueueLatencyKey, 
        
           	Help:      "How long in seconds an item stays in workqueue before being requested", 
        
           	Buckets:   prometheus.ExponentialBuckets(10e-9, 10, 10), 
        
           }, []string{"name"}) 
        
           workDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{ 
        
           	Subsystem: WorkQueueSubsystem, 
        
           	Name:      WorkDurationKey, 
        
           	Help:      "How long in seconds processing an item from workqueue takes.", 
        
           	Buckets:   prometheus.ExponentialBuckets(10e-9, 10, 10), 
        
           }, []string{"name"})

I think these should go up a further 2 steps to include 100s and 1000s as buckets (and maybe drop the lowest 2 buckets if we're concerned about the number), especially for workqueue_queue_duration_seconds as a stuck / poorly implemented controller can have quite high latency in processing reconciliation requests

The text was updated successfully, but these errors were encountered:

troy0820 · 2023-12-18T22:39:24Z

/kind support

troy0820 · 2024-01-05T20:03:01Z

/kind feature

Would increasing the bucket size be beneficial to provide that info if that limit is never reached? I am not certain why these sizes were picked but 1000s seems a bit long and arbitrary to define for something that may not ever hit that limit. I'm interested in understanding why having these upper bounds would be helpful beyond the need to capture the high latency in the reconciliation requests.

seankhliao · 2024-01-05T21:25:23Z

You'd want the last defined bucket to be past the upper bound of what can be considered "reasonable" values to be able to estimate where the top end of the values you observe are (since monitoring system work by interpolating between bucket bounds to estimate percentiles). The +Inf bucket captures everything, but isn't useful for interpolation since it's infinity.

We regularly see "normal" queue latency on the order of 1-2 minutes for our larger clusters with controllers that need to make many external calls (not a great situation to be in) so placing the upper bound at the next step up (1000s) seems reasonable (and hopefully nothing normally operates at 1000s latency).

Aside: If prometheus ever stabilizes their native histograms (true expontnial), then we wouldn't need to choose these bucket bounds so carefully.

troy0820 · 2024-01-05T21:39:00Z

I don't believe this will have any other implications where this would ruin anything from before, so I feel like this is a reasonable requests.

@sbueringer WDYT?

With this proposal to increase the buckets, pending buy-in from the maintainers, do you want to do the PR? @seankhliao

Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s. Add buckets that cover the range, allowing metric backends to interpolate percentile estimates properly. Fixes kubernetes-sigs#2625

seankhliao · 2024-01-05T21:58:36Z

sure, I've sent #2638

troy0820 · 2024-01-11T15:39:30Z

/close

Resolved #2638

k8s-ci-robot · 2024-01-11T15:39:34Z

@troy0820: Closing this issue.

In response to this:

/close

Resolved #2638

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

nathanperkins · 2024-10-29T23:01:12Z

Thank you for this change :)

Our buckets were pinned at 10s and not useful enough for us to create alerts on their own.

k8s-ci-robot added the kind/support Categorizes issue or PR as a support question. label Dec 18, 2023

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 5, 2024

seankhliao mentioned this issue Jan 5, 2024

✨ Add 100/1000s buckets for prometheus workqueue histograms #2638

Merged

k8s-ci-robot closed this as completed Jan 11, 2024

youwalther65 mentioned this issue Jul 15, 2024

Add 100/1000s buckets for prometheus workqueue histograms QueueLatencyKey and WorkDurationKey kubernetes/kubernetes#126100

Open

youwalther65 mentioned this issue Oct 1, 2024

Add 100/1000s buckets for prometheus workqueue histograms QueueLatencyKey and WorkDurationKey kubernetes/kubernetes#126101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

higher buckets for workqueue histograms #2625

higher buckets for workqueue histograms #2625

seankhliao commented Dec 18, 2023

troy0820 commented Dec 18, 2023

troy0820 commented Jan 5, 2024

seankhliao commented Jan 5, 2024

troy0820 commented Jan 5, 2024

seankhliao commented Jan 5, 2024

troy0820 commented Jan 11, 2024

k8s-ci-robot commented Jan 11, 2024

nathanperkins commented Oct 29, 2024

higher buckets for workqueue histograms #2625

higher buckets for workqueue histograms #2625

Comments

seankhliao commented Dec 18, 2023

troy0820 commented Dec 18, 2023

troy0820 commented Jan 5, 2024

seankhliao commented Jan 5, 2024

troy0820 commented Jan 5, 2024

seankhliao commented Jan 5, 2024

troy0820 commented Jan 11, 2024

k8s-ci-robot commented Jan 11, 2024

nathanperkins commented Oct 29, 2024