Skip to content

Dynamic controller scaling #2576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vincepri opened this issue Nov 7, 2023 · 10 comments
Open

Dynamic controller scaling #2576

vincepri opened this issue Nov 7, 2023 · 10 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@vincepri
Copy link
Member

vincepri commented Nov 7, 2023

Currently we allow to specify a fixed number of nodes for each controller.

After attending the talk at KubeCon on how to scale Cluster API to 2k clusters (link tba), it's be good to allow controller runtime to spin up and down workers dynamically based on objects in the queue, and on the 90th percentile of the overall duration of the reconciler.

@sqbi1024
Copy link

sqbi1024 commented Nov 8, 2023

Can you describe it in more detail?

@halfcrazy
Copy link
Contributor

halfcrazy commented Nov 13, 2023

I think there are two tasks

  1. Change the reconciler's worker numbers at runtime.
  2. Implement a built-in backpressure/auto-scaling mechanism based on metrics [1].

@troy0820
Copy link
Member

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Nov 13, 2023
@timebertt
Copy link
Contributor

I'm skeptical whether changing the number of workers during runtime is a good idea.
I always considered the number of workers to be the "size" of a controller – somewhat related to resource requests/limits.
Increasing the number of workers without increasing its requests/limits might cause the process to be throttled, i.e., it might not help in increasing the controller's capacity/throughput.

Instead, I suggest looking into horizontally scaling controllers including some form of sharding.
I explored the idea in this project: https://github.com/timebertt/kubernetes-controller-sharding

@vincepri
Copy link
Member Author

I'm skeptical whether changing the number of workers during runtime is a good idea.

Like any other change we usually propose, this would be opt in.

Instead, I suggest looking into horizontally scaling controllers including some form of sharding.

Controller Runtime is focused on a single controller scenario acting as a leader for the time being; but this is probably good to document outside of this project.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 25, 2024
@vincepri
Copy link
Member Author

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 26, 2024
@shubhM13
Copy link

shubhM13 commented Mar 4, 2025

Hi @vincepri,

We’re seeing similar issues with Spark Operator—event spikes overwhelm the controller, causing high latencies and timeouts for our time-sensitive batch workloads.

Are you proposing dynamically adjusting MaxConcurrentReconciles based on queue depth and reconciliation latency, or modifying controller thread scaling more broadly? Would love to understand the approach and potentially contribute in this area

@sbueringer
Copy link
Member

I would just use #2374

@shubhM13
Copy link

Thanks @sbueringer - to use that feature, is that just a boolean set, while initializing the controller ? or we also are expected to define priority levels, and handle priority assignments to events ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

9 participants