Skip to content

Discussion on operations allowed while a managed cluster is being upgraded #5222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ykakarap opened this issue Sep 9, 2021 · 13 comments
Closed
Assignees
Labels
area/clusterclass Issues or PRs related to clusterclass kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@ykakarap
Copy link
Contributor

ykakarap commented Sep 9, 2021

This issue is meant to be an ongoing discussion/brainstorming on the list of operations that are allowed/disallowed during a cluster upgrade.

A non-exhaustive list of operations to consider:

  • Scaling up/down Control Plane
  • Modifying Control Plane (changing infrastructure templates)
  • Scaling up/down MachineDeployments
  • Modifying MachineDeployments (changing the bootstrap or infrastructure templates)
  • Create new machine deployments

/kind design
/area topology

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/topology labels Sep 9, 2021
@ykakarap
Copy link
Contributor Author

ykakarap commented Sep 9, 2021

This issue supersedes #5183.

@sbueringer
Copy link
Member

@ykakarap Do you mean during a control-plane upgrade?

@ykakarap
Copy link
Contributor Author

ykakarap commented Sep 9, 2021

@sbueringer mostly yes. However lets use this to look into any considerations to make while machine deployments are upgrading.

@vincepri
Copy link
Member

vincepri commented Sep 9, 2021

Let's run some tests around the above points and figure out if this is something we need to tackle today or later.

/milestone v0.4

@fabriziopandini
Copy link
Member

/milestone v1.0

@ykakarap
Copy link
Contributor Author

/assign
to run some tests around the above points.

@fabriziopandini
Copy link
Member

fabriziopandini commented Oct 14, 2021

I have made some investigations on this point (with focus on KCP, but I think the same apply to other control plane providers), and the TL;DR; is:

  1. The system seems already equipped to manage concurrent changes on CP, on workers or both.
  2. There is a possible risk that we are creating machines with K8s version > than K8s version in the CP (both in managed topologies or in unmanaged clusters).
  3. In managed topologies there is room for limiting the user the possibility to do concurrent operations, thus making operations more straight forward/predictable. But I'm not sure this is worth to do

More in detail

  1. Both KCP and MD are designed to reconcile to the desired state, no matter of the diff between current and desired. This makes them resilient to multiple changes to the same object. Also, in case both KCP and MD are changing concurrently, the only dependency between the two is a stable API endpoint, and KCP is designed to ensure this.

  2. As of today nothing prevents the user to create machines with K8s version > than K8s version in the CP; this risk exists also in ClusterClass given that during the first phase of an upgrade topology.version is greater than CP min version (and new MD uses topology.version). IMO there are two options to explore here (not exclusive):

    1. Make ClusterClass to delay creation of new MD when this condition is detected
    2. Make Machine (or most probably CABPK) to detect this condition and keep the machine creation on hold. However this requires to surface CP version at Cluster level
  3. I'm personally -1 to implement this option. We can try to prevent multiple changes to the same object or concurrent operations between KCP & MD, but the truth is this can't prevent this to happen for external factors, like MHC, autoscales or changes to ClusterClasses. Thus I'm leaning to not introduce those limitations that might impact users and give us the false security that concurrent operations could not happen.

@vincepri @ykakarap @sbueringer opinions

@kfox1111
Copy link

The ElasticSearch operator has the same issue with ES / Kibana. Kibana cant be greater version then ES. In their operator, they chose option II. You can see with a 'describe kibana' an event stating its holding off performing the requested upgrade of Kibana until ES reaches at least the same version. Its been pretty nice to use.

@sbueringer
Copy link
Member

In general 2.ii sounds like the cleanest solution. It sounds to me like it would work in all cases. Doesn't necessarily mean we have to implement it right now.

@vincepri vincepri modified the milestones: v1.0, v1.1 Oct 22, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2022
@fabriziopandini
Copy link
Member

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2022
@fabriziopandini fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the v1.2 milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

/close
I'm working with @chrischdi to open an issue to enforce validation on version fields that will solve this issue as well

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
I'm working with @chrischdi to open an issue to enforce validation on version fields that will solve this issue as well

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@killianmuldoon killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

8 participants