Discussion on operations allowed while a managed cluster is being upgraded #5222

ykakarap · 2021-09-09T05:12:54Z

This issue is meant to be an ongoing discussion/brainstorming on the list of operations that are allowed/disallowed during a cluster upgrade.

A non-exhaustive list of operations to consider:

Scaling up/down Control Plane
Modifying Control Plane (changing infrastructure templates)
Scaling up/down MachineDeployments
Modifying MachineDeployments (changing the bootstrap or infrastructure templates)
Create new machine deployments

/kind design
/area topology

ykakarap · 2021-09-09T05:13:56Z

This issue supersedes #5183.

sbueringer · 2021-09-09T06:19:06Z

@ykakarap Do you mean during a control-plane upgrade?

ykakarap · 2021-09-09T15:02:40Z

@sbueringer mostly yes. However lets use this to look into any considerations to make while machine deployments are upgrading.

vincepri · 2021-09-09T17:49:17Z

Let's run some tests around the above points and figure out if this is something we need to tackle today or later.

/milestone v0.4

fabriziopandini · 2021-09-24T15:53:23Z

/milestone v1.0

ykakarap · 2021-09-27T15:34:24Z

/assign
to run some tests around the above points.

fabriziopandini · 2021-10-14T14:38:00Z

I have made some investigations on this point (with focus on KCP, but I think the same apply to other control plane providers), and the TL;DR; is:

The system seems already equipped to manage concurrent changes on CP, on workers or both.
There is a possible risk that we are creating machines with K8s version > than K8s version in the CP (both in managed topologies or in unmanaged clusters).
In managed topologies there is room for limiting the user the possibility to do concurrent operations, thus making operations more straight forward/predictable. But I'm not sure this is worth to do

More in detail

Both KCP and MD are designed to reconcile to the desired state, no matter of the diff between current and desired. This makes them resilient to multiple changes to the same object. Also, in case both KCP and MD are changing concurrently, the only dependency between the two is a stable API endpoint, and KCP is designed to ensure this.
As of today nothing prevents the user to create machines with K8s version > than K8s version in the CP; this risk exists also in ClusterClass given that during the first phase of an upgrade topology.version is greater than CP min version (and new MD uses topology.version). IMO there are two options to explore here (not exclusive):
1. Make ClusterClass to delay creation of new MD when this condition is detected
2. Make Machine (or most probably CABPK) to detect this condition and keep the machine creation on hold. However this requires to surface CP version at Cluster level
I'm personally -1 to implement this option. We can try to prevent multiple changes to the same object or concurrent operations between KCP & MD, but the truth is this can't prevent this to happen for external factors, like MHC, autoscales or changes to ClusterClasses. Thus I'm leaning to not introduce those limitations that might impact users and give us the false security that concurrent operations could not happen.

@vincepri @ykakarap @sbueringer opinions

kfox1111 · 2021-10-14T15:39:37Z

The ElasticSearch operator has the same issue with ES / Kibana. Kibana cant be greater version then ES. In their operator, they chose option II. You can see with a 'describe kibana' an event stating its holding off performing the requested upgrade of Kibana until ES reaches at least the same version. Its been pretty nice to use.

sbueringer · 2021-10-19T16:33:27Z

In general 2.ii sounds like the cleanest solution. It sounds to me like it would work in all cases. Doesn't necessarily mean we have to implement it right now.

k8s-triage-robot · 2022-01-20T19:23:08Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2022-01-20T19:56:34Z

/lifecycle frozen

fabriziopandini · 2022-07-29T18:04:14Z

/close
I'm working with @chrischdi to open an issue to enforce validation on version fields that will solve this issue as well

k8s-ci-robot · 2022-07-29T18:04:24Z

@fabriziopandini: Closing this issue.

In response to this:

/close
I'm working with @chrischdi to open an issue to enforce validation on version fields that will solve this issue as well

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. area/topology labels Sep 9, 2021

k8s-ci-robot added this to the v0.4 milestone Sep 9, 2021

randomvariable mentioned this issue Sep 10, 2021

Ability to manage core cluster deployments and daemonsets prior to upgrading a ClusterClass managed cluster #5230

Closed

k8s-ci-robot modified the milestones: v0.4, v1.0 Sep 24, 2021

fabriziopandini mentioned this issue Sep 24, 2021

Adding new machine deployments while control plane is upgrading in managed topologies #5183

Closed

k8s-ci-robot assigned ykakarap Sep 27, 2021

vincepri modified the milestones: v1.0, v1.1 Oct 22, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 20, 2022

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 20, 2022

fabriziopandini mentioned this issue Jan 20, 2022

Consider adding version information to the Cluster status #5341

Closed

fabriziopandini modified the milestones: v1.1, v1.2 Feb 3, 2022

fabriziopandini mentioned this issue Feb 3, 2022

🐛 Bootstrap machine only if it conforms to the version skew policy #6044

Closed

2 tasks

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot closed this as completed Jul 29, 2022

killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discussion on operations allowed while a managed cluster is being upgraded #5222

Discussion on operations allowed while a managed cluster is being upgraded #5222

ykakarap commented Sep 9, 2021 •

edited

Loading

ykakarap commented Sep 9, 2021

Uh oh!

sbueringer commented Sep 9, 2021

Uh oh!

ykakarap commented Sep 9, 2021

Uh oh!

vincepri commented Sep 9, 2021

Uh oh!

fabriziopandini commented Sep 24, 2021

Uh oh!

ykakarap commented Sep 27, 2021

Uh oh!

fabriziopandini commented Oct 14, 2021 •

edited

Loading

Uh oh!

kfox1111 commented Oct 14, 2021

Uh oh!

sbueringer commented Oct 19, 2021

Uh oh!

k8s-triage-robot commented Jan 20, 2022

Uh oh!

fabriziopandini commented Jan 20, 2022

Uh oh!

fabriziopandini commented Jul 29, 2022

Uh oh!

k8s-ci-robot commented Jul 29, 2022

Uh oh!

Discussion on operations allowed while a managed cluster is being upgraded #5222

Discussion on operations allowed while a managed cluster is being upgraded #5222

Comments

ykakarap commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ykakarap commented Sep 9, 2021

Uh oh!

sbueringer commented Sep 9, 2021

Uh oh!

ykakarap commented Sep 9, 2021

Uh oh!

vincepri commented Sep 9, 2021

Uh oh!

fabriziopandini commented Sep 24, 2021

Uh oh!

ykakarap commented Sep 27, 2021

Uh oh!

fabriziopandini commented Oct 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfox1111 commented Oct 14, 2021

Uh oh!

sbueringer commented Oct 19, 2021

Uh oh!

k8s-triage-robot commented Jan 20, 2022

Uh oh!

fabriziopandini commented Jan 20, 2022

Uh oh!

fabriziopandini commented Jul 29, 2022

Uh oh!

k8s-ci-robot commented Jul 29, 2022

Uh oh!

ykakarap commented Sep 9, 2021 •

edited

Loading

fabriziopandini commented Oct 14, 2021 •

edited

Loading