✨ Add 100/1000s buckets for prometheus workqueue histograms #2638

seankhliao · 2024-01-05T21:56:29Z

Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s. Add buckets that cover the range, allowing metric backends to interpolate percentile estimates properly.

For #2625

k8s-ci-robot · 2024-01-05T21:56:39Z

Welcome @seankhliao!

It looks like this is your first PR to kubernetes-sigs/controller-runtime 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/controller-runtime has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-01-05T21:56:39Z

Hi @seankhliao. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s. Add buckets that cover the range, allowing metric backends to interpolate percentile estimates properly.

troy0820 · 2024-01-05T22:11:42Z

/ok-to-test

troy0820 · 2024-01-05T22:12:39Z

/retitle ✨ Add 100/1000s buckets for prometheus workqueue histograms

troy0820 · 2024-01-05T22:15:33Z

It can't merge without approval but I want the maintainers to be able to take a look at this to get consensus around this solution.

/hold

alvaroaleman · 2024-01-06T23:53:44Z

Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s

Not really, no. This would be indicative of doing long-running and/or blocking operations in your Reconcile which you shouldn't do, use RequeueAfter instead.

seankhliao · 2024-01-07T01:13:25Z

Or you have a poorly written vendor controller, or we just have a lot of work and we're starved of parallelism. Either way, I'd like to be able to observe how poorly it's performing beyond just 10s.

vincepri · 2024-01-08T16:24:37Z

Not opposed to this, @alvaroaleman ? Seems a small improvement?

sbueringer · 2024-01-08T18:58:06Z

In the past I used controller_runtime_reconcile_time_seconds for this purpose. Exactly because the workqueue metrics are capped at 10 seconds.

controller-runtime/pkg/internal/controller/metrics/metrics.go

Lines 51 to 56 in d10ae95

    
           ReconcileTime = prometheus.NewHistogramVec(prometheus.HistogramOpts{ 
        
           	Name: "controller_runtime_reconcile_time_seconds", 
        
           	Help: "Length of time per reconciliation per controller", 
        
           	Buckets: []float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4, 0.45, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 
        
           		1.25, 1.5, 1.75, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 40, 50, 60}, 
        
           }, []string{"controller"})

(Might be helpful retroactively in any case)

In general no objection to adding additional buckets. I have no idea what the ideal buckets are to be honest.

For what it's worth upstream still only has buckets until 10s: https://github.com/kubernetes/component-base/blob/master/metrics/prometheus/workqueue/metrics.go#L55-L69

sbueringer · 2024-01-08T19:01:03Z

For folks that are not familiar with metrics it can look pretty confusing that the metrics are capped at 10s (as it looks like the durations are never slower than that)

troy0820 · 2024-01-08T20:21:47Z

In general no objection to adding additional buckets. I have no idea what the ideal buckets are to be honest.

I have no idea what is ideal as well. Initially in the issue, I wasn't trying to encourage more buckets as that may or may not impact performance, however the need for an upper bound outside of 10s seemed to be necessary.

For what it's worth upstream still only has buckets until 10s: https://github.com/kubernetes/component-base/blob/master/metrics/prometheus/workqueue/metrics.go#L55-L69

However, however, to stay consistent with upstream, would this be beneficial to add and not use the other metrics as you proposed?

sbueringer · 2024-01-09T09:30:58Z

However, however, to stay consistent with upstream, would this be beneficial to add and not use the other metrics as you proposed?

The controller_runtime_reconcile_time_seconds metric only ~ matches one of the workqueue metrics. So it definitely makes sense for folks to also use the work queue metrics. They are also not 100% the same data, just (I think) one of them is very close.

I think performance-wise it's probably fine. I guess it's just 20% more data for the two histogram metrics. As the number of workqueues in a controller is usually not that high that it should become a problem I think it's okay.

(Not 100% sure if I understood "add and not use the other metrics as you proposed?" correctly)

seankhliao · 2024-01-09T14:17:38Z

If we're concerned about data volumes, I think dropping the first 2 or 3 buckets might be ok?
Most code won't have a discernable different between 1ms - 1us

sbueringer · 2024-01-11T13:11:37Z

I discussed it with Alvaro. We're fine with the change for now.

A larger issue we have is that all the metric configuration is done in init functions. Ideally we would instead configure metrics during startup (~ manager.Start()). This would make it possible for folks to configure metrics depending on their needs instead of having to live with the hard-coded configuration we have. We'll try to get this improved in the upcoming release cycle. At this point we also might re-evaluate the "default configuration" of the metrics.

/lgtm
/approve
/hold cancel

k8s-ci-robot · 2024-01-11T13:11:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer, seankhliao

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 5, 2024

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 5, 2024

k8s-ci-robot requested review from inteon and vincepri January 5, 2024 21:56

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jan 5, 2024

pkg/metrics: add 100/1000s buckets for workqueue histograms

f4fe233

Controllers making many external requests in large clusters may have normal operating latency on the order of ~100s. Add buckets that cover the range, allowing metric backends to interpolate percentile estimates properly.

seankhliao force-pushed the higher-end-buckets branch from 8175828 to f4fe233 Compare January 5, 2024 21:57

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jan 5, 2024

seankhliao mentioned this pull request Jan 5, 2024

higher buckets for workqueue histograms #2625

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 5, 2024

k8s-ci-robot changed the title ~~✨ pkg/metrics: add 100/1000s buckets for workqueue histograms~~ ✨ Add 100/1000s buckets for prometheus workqueue histograms Jan 5, 2024

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 5, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 11, 2024

k8s-ci-robot assigned sbueringer Jan 11, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 11, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2024

sbueringer added this to the v0.17.x milestone Jan 11, 2024

k8s-ci-robot merged commit 0a90173 into kubernetes-sigs:main Jan 11, 2024

seankhliao deleted the higher-end-buckets branch January 11, 2024 17:03

kcp-ci-bot mentioned this pull request Feb 29, 2024

✨ Bump to v1.17.2 kcp-dev/controller-runtime#47

Closed

This was referenced Jul 15, 2024

Add 100/1000s buckets for prometheus workqueue histograms QueueLatencyKey and WorkDurationKey kubernetes/kubernetes#126100

Open

Add 100/1000s buckets for prometheus workqueue histograms QueueLatencyKey and WorkDurationKey kubernetes/kubernetes#126101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ Add 100/1000s buckets for prometheus workqueue histograms #2638

✨ Add 100/1000s buckets for prometheus workqueue histograms #2638

seankhliao commented Jan 5, 2024 •

edited

Loading

k8s-ci-robot commented Jan 5, 2024

k8s-ci-robot commented Jan 5, 2024

troy0820 commented Jan 5, 2024

troy0820 commented Jan 5, 2024

troy0820 commented Jan 5, 2024

alvaroaleman commented Jan 6, 2024

seankhliao commented Jan 7, 2024

vincepri commented Jan 8, 2024

sbueringer commented Jan 8, 2024 •

edited

Loading

sbueringer commented Jan 8, 2024 •

edited

Loading

troy0820 commented Jan 8, 2024

sbueringer commented Jan 9, 2024 •

edited

Loading

seankhliao commented Jan 9, 2024

sbueringer commented Jan 11, 2024

k8s-ci-robot commented Jan 11, 2024

✨ Add 100/1000s buckets for prometheus workqueue histograms #2638

✨ Add 100/1000s buckets for prometheus workqueue histograms #2638

Conversation

seankhliao commented Jan 5, 2024 • edited Loading

k8s-ci-robot commented Jan 5, 2024

k8s-ci-robot commented Jan 5, 2024

troy0820 commented Jan 5, 2024

troy0820 commented Jan 5, 2024

troy0820 commented Jan 5, 2024

alvaroaleman commented Jan 6, 2024

seankhliao commented Jan 7, 2024

vincepri commented Jan 8, 2024

sbueringer commented Jan 8, 2024 • edited Loading

sbueringer commented Jan 8, 2024 • edited Loading

troy0820 commented Jan 8, 2024

sbueringer commented Jan 9, 2024 • edited Loading

seankhliao commented Jan 9, 2024

sbueringer commented Jan 11, 2024

k8s-ci-robot commented Jan 11, 2024

seankhliao commented Jan 5, 2024 •

edited

Loading

sbueringer commented Jan 8, 2024 •

edited

Loading

sbueringer commented Jan 8, 2024 •

edited

Loading

sbueringer commented Jan 9, 2024 •

edited

Loading