Dynamic controller scaling #2576

vincepri · 2023-11-07T22:50:16Z

Currently we allow to specify a fixed number of nodes for each controller.

After attending the talk at KubeCon on how to scale Cluster API to 2k clusters (link tba), it's be good to allow controller runtime to spin up and down workers dynamically based on objects in the queue, and on the 90th percentile of the overall duration of the reconciler.

sqbi1024 · 2023-11-08T18:21:48Z

Can you describe it in more detail?

halfcrazy · 2023-11-13T03:16:06Z

I think there are two tasks

Change the reconciler's worker numbers at runtime.
Implement a built-in backpressure/auto-scaling mechanism based on metrics [1].

troy0820 · 2023-11-13T14:54:27Z

/kind feature

timebertt · 2023-11-27T08:07:58Z

I'm skeptical whether changing the number of workers during runtime is a good idea.
I always considered the number of workers to be the "size" of a controller – somewhat related to resource requests/limits.
Increasing the number of workers without increasing its requests/limits might cause the process to be throttled, i.e., it might not help in increasing the controller's capacity/throughput.

Instead, I suggest looking into horizontally scaling controllers including some form of sharding.
I explored the idea in this project: https://github.com/timebertt/kubernetes-controller-sharding

vincepri · 2023-11-27T17:08:53Z

I'm skeptical whether changing the number of workers during runtime is a good idea.

Like any other change we usually propose, this would be opt in.

Instead, I suggest looking into horizontally scaling controllers including some form of sharding.

Controller Runtime is focused on a single controller scenario acting as a leader for the time being; but this is probably good to document outside of this project.

k8s-triage-robot · 2024-02-25T17:34:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vincepri · 2024-02-26T22:30:19Z

/lifecycle frozen

shubhM13 · 2025-03-04T01:10:46Z

Hi @vincepri,

We’re seeing similar issues with Spark Operator—event spikes overwhelm the controller, causing high latencies and timeouts for our time-sensitive batch workloads.

Are you proposing dynamically adjusting MaxConcurrentReconciles based on queue depth and reconciliation latency, or modifying controller thread scaling more broadly? Would love to understand the approach and potentially contribute in this area

sbueringer · 2025-03-04T06:20:24Z

I would just use #2374

shubhM13 · 2025-04-20T05:37:20Z

Thanks @sbueringer - to use that feature, is that just a boolean set, while initializing the controller ? or we also are expected to define priority levels, and handle priority assignments to events ?

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 13, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2024

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 26, 2024

halfcrazy mentioned this issue Apr 7, 2025

Add an interruptible work queue kubernetes/kubernetes#131185

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic controller scaling #2576

Dynamic controller scaling #2576

vincepri commented Nov 7, 2023

sqbi1024 commented Nov 8, 2023

halfcrazy commented Nov 13, 2023 •

edited

Loading

troy0820 commented Nov 13, 2023

timebertt commented Nov 27, 2023

vincepri commented Nov 27, 2023

k8s-triage-robot commented Feb 25, 2024

vincepri commented Feb 26, 2024

shubhM13 commented Mar 4, 2025

sbueringer commented Mar 4, 2025

shubhM13 commented Apr 20, 2025

Dynamic controller scaling #2576

Dynamic controller scaling #2576

Comments

vincepri commented Nov 7, 2023

sqbi1024 commented Nov 8, 2023

halfcrazy commented Nov 13, 2023 • edited Loading

troy0820 commented Nov 13, 2023

timebertt commented Nov 27, 2023

vincepri commented Nov 27, 2023

k8s-triage-robot commented Feb 25, 2024

vincepri commented Feb 26, 2024

shubhM13 commented Mar 4, 2025

sbueringer commented Mar 4, 2025

shubhM13 commented Apr 20, 2025

halfcrazy commented Nov 13, 2023 •

edited

Loading