@@ -117,34 +117,17 @@ use.
117
117
This allows us to * inform* the admin for removals that are more than one minor
118
118
version away and * block* upgrades for removals which are imminent.
119
119
120
- ### MCO - Enforce OpenShift's defined host component version skew policies
121
-
122
- The MCO, will set Upgradeable=False whenever any MachineConfigPool has one more
123
- more nodes present which fall outside of a defined list of constraints. For
124
- instance, if OpenShift has a defined Kubelet Version Skew of N-1, the node
125
- constraints enforced by the MCO defined in OCP 4.7 (Kube 1.20) would be as follows:
126
-
127
- ``` yaml
128
- node.status.nodeInfo.kubeletVersion :
129
- - v1.20
130
- ` ` `
131
-
132
- If the policy were to change allowing for a version skew of N-2, v1.19 would be
133
- added to the list of acceptable matches. As a result a cluster which had been
134
- upgraded from 4.6 to 4.7 would allow a subsequent upgrade to 4.8 as long as all
135
- kubelets were either v1.19 or v1.20. The 4.8 MCO would then evaluate the Upgradeable
136
- condition based on its constraints, if v1.19 weren't allowed it would then
137
- inhibit upgrades to 4.9. This means the MCO must set Upgradeable=False until it
138
- has confirmed constraints have been met.
139
-
140
- ` ` ` yaml
141
- node.status.nodeInfo.kubeletVersion :
142
- - v1.20
143
- - v1.19
144
- ` ` `
145
-
146
- The MCO is not responsible for defining these constraints and constraints are
147
- only widened whenever we have CI testing proves them to be safe.
120
+ ### APIServer - Enforce OpenShift's defined kubelet version skew policies
121
+
122
+ The API Server Operator will set ` Upgradeable=False ` whenever any of the nodes
123
+ within the cluster are at the skew limit; that is, when an upgrade of the API
124
+ Server would exceed the allowable kubelet version skew. For instance, if
125
+ OpenShift has a defined kubelet version skew of N-1, the API Server Operator
126
+ would report ` Upgradeable=True ` if all of the nodes are at N, and
127
+ ` Upgradeable=False ` if at least one of the nodes is not up to date. If the
128
+ kubelet skew policy were to change, allowing for a version skew of N-2, the API
129
+ Server Operator would report ` Upgradeable=True ` if all of the nodes are at N or
130
+ N-1, and ` Upgradeable=False ` if any of the nodes are at N-2.
148
131
149
132
These changes will need to be backported to 4.7 prior to 4.7 EOL.
150
133
@@ -304,8 +287,8 @@ that's broadly scoped as "EUS 4.6 to EUS 4.10 Validator"?
304
287
305
288
- CI tests are necessary which attempt to upgrade while violating kubelet to API
306
289
compatibility, ie: 4.6 to 4.7 upgrade with MachineConfigPools paused, then check
307
- for Upgradeable=False condition to be set by MCO assuming that our rules only allow
308
- for N-1 skew.
290
+ for Upgradeable=False condition to be set by the API Server Operator, assuming
291
+ that our rules only allow for N-1 skew.
309
292
- CI tests are necessary which install an OLM Operator which expresses a maxKubeVersion
310
293
or maxOCPVersion equal to the current cluster version and checks for Upgradeable=False
311
294
on OLM
@@ -393,6 +376,32 @@ The idea is to find the best form of an argument why this enhancement should _no
393
376
394
377
## Alternatives
395
378
379
+ ### MCO Kubelet Skew Enforcement
380
+
381
+ Instead of the API Server Operator enforcing kubelet skew compliance through
382
+ the ` Upgradeable ` flag, the MCO could provide this functionality. Either of
383
+ these two operators are the obvious choice for such a check since they are
384
+ responsible for both halves of the kubelet-API Server interaction. It makes
385
+ more sense for the leading component to implement the check, however, since
386
+ it's the leading edge that's going to violate the skew compliance first. In the
387
+ case of OpenShift, that leading edge is the API Server and it makes more sense
388
+ for it to determine whether a step forward is going to violate the skew policy.
389
+ On top of that, the gating mechanism we have today is the ` Upgradeable=False `
390
+ flag, which indicates that a particular operator cannot be upgraded, thereby
391
+ halting the upgrade of the entire cluster. It doesn't make sense for the MCO to
392
+ assert this condition, since an upgrade of the MCO and its operands (RHCOS)
393
+ would actually reduce the skew. If the MCO were to use this mechanism to
394
+ enforce the skew, it would be a reinterpretation of the function of that flag
395
+ to instead indicate that the entire cluster cannot be upgraded. It's a subtle
396
+ but important distinction that preserves low coupling between individual
397
+ operators.
398
+
399
+ ### MCO Rollout Gating
400
+
401
+ (This section was written assuming that the MCO would be responsible for
402
+ enforcing the node skew policy, but this plan has since been modified to make
403
+ the API Server Operator responsible for this enforcement.)
404
+
396
405
Rather than having MCO enforce version skew policies between OS managed
397
406
components and operator managed components it could simply set Upgradeable=False
398
407
whenever a rollout is in progress. This would preclude minor version upgrades in
0 commit comments