Skip to content

Commit 0ba744e

Browse files
committed
eus-upgrades-mvp: don't enforce skew check in MCO
This changes the proposal to specify that the API Server Operator should instead be responsible for enforcing the kubelet version skew.
1 parent cef8989 commit 0ba744e

File tree

1 file changed

+39
-30
lines changed

1 file changed

+39
-30
lines changed

enhancements/update/eus-upgrades-mvp.md

+39-30
Original file line numberDiff line numberDiff line change
@@ -117,34 +117,17 @@ use.
117117
This allows us to *inform* the admin for removals that are more than one minor
118118
version away and *block* upgrades for removals which are imminent.
119119

120-
### MCO - Enforce OpenShift's defined host component version skew policies
121-
122-
The MCO, will set Upgradeable=False whenever any MachineConfigPool has one more
123-
more nodes present which fall outside of a defined list of constraints. For
124-
instance, if OpenShift has a defined Kubelet Version Skew of N-1, the node
125-
constraints enforced by the MCO defined in OCP 4.7 (Kube 1.20) would be as follows:
126-
127-
```yaml
128-
node.status.nodeInfo.kubeletVersion:
129-
- v1.20
130-
```
131-
132-
If the policy were to change allowing for a version skew of N-2, v1.19 would be
133-
added to the list of acceptable matches. As a result a cluster which had been
134-
upgraded from 4.6 to 4.7 would allow a subsequent upgrade to 4.8 as long as all
135-
kubelets were either v1.19 or v1.20. The 4.8 MCO would then evaluate the Upgradeable
136-
condition based on its constraints, if v1.19 weren't allowed it would then
137-
inhibit upgrades to 4.9. This means the MCO must set Upgradeable=False until it
138-
has confirmed constraints have been met.
139-
140-
```yaml
141-
node.status.nodeInfo.kubeletVersion:
142-
- v1.20
143-
- v1.19
144-
```
145-
146-
The MCO is not responsible for defining these constraints and constraints are
147-
only widened whenever we have CI testing proves them to be safe.
120+
### APIServer - Enforce OpenShift's defined kubelet version skew policies
121+
122+
The API Server Operator will set `Upgradeable=False` whenever any of the nodes
123+
within the cluster are at the skew limit; that is, when an upgrade of the API
124+
Server would exceed the allowable kubelet version skew. For instance, if
125+
OpenShift has a defined kubelet version skew of N-1, the API Server Operator
126+
would report `Upgradeable=True` if all of the nodes are at N, and
127+
`Upgradeable=False` if at least one of the nodes is not up to date. If the
128+
kubelet skew policy were to change, allowing for a version skew of N-2, the API
129+
Server Operator would report `Upgradeable=True` if all of the nodes are at N or
130+
N-1, and `Upgradeable=False` if any of the nodes are at N-2.
148131

149132
These changes will need to be backported to 4.7 prior to 4.7 EOL.
150133

@@ -304,8 +287,8 @@ that's broadly scoped as "EUS 4.6 to EUS 4.10 Validator"?
304287

305288
- CI tests are necessary which attempt to upgrade while violating kubelet to API
306289
compatibility, ie: 4.6 to 4.7 upgrade with MachineConfigPools paused, then check
307-
for Upgradeable=False condition to be set by MCO assuming that our rules only allow
308-
for N-1 skew.
290+
for Upgradeable=False condition to be set by the API Server Operator, assuming
291+
that our rules only allow for N-1 skew.
309292
- CI tests are necessary which install an OLM Operator which expresses a maxKubeVersion
310293
or maxOCPVersion equal to the current cluster version and checks for Upgradeable=False
311294
on OLM
@@ -393,6 +376,32 @@ The idea is to find the best form of an argument why this enhancement should _no
393376

394377
## Alternatives
395378

379+
### MCO Kubelet Skew Enforcement
380+
381+
Instead of the API Server Operator enforcing kubelet skew compliance through
382+
the `Upgradeable` flag, the MCO could provide this functionality. Either of
383+
these two operators are the obvious choice for such a check since they are
384+
responsible for both halves of the kubelet-API Server interaction. It makes
385+
more sense for the leading component to implement the check, however, since
386+
it's the leading edge that's going to violate the skew compliance first. In the
387+
case of OpenShift, that leading edge is the API Server and it makes more sense
388+
for it to determine whether a step forward is going to violate the skew policy.
389+
On top of that, the gating mechanism we have today is the `Upgradeable=False`
390+
flag, which indicates that a particular operator cannot be upgraded, thereby
391+
halting the upgrade of the entire cluster. It doesn't make sense for the MCO to
392+
assert this condition, since an upgrade of the MCO and its operands (RHCOS)
393+
would actually reduce the skew. If the MCO were to use this mechanism to
394+
enforce the skew, it would be a reinterpretation of the function of that flag
395+
to instead indicate that the entire cluster cannot be upgraded. It's a subtle
396+
but important distinction that preserves low coupling between individual
397+
operators.
398+
399+
### MCO Rollout Gating
400+
401+
(This section was written assuming that the MCO would be responsible for
402+
enforcing the node skew policy, but this plan has since been modified to make
403+
the API Server Operator responsible for this enforcement.)
404+
396405
Rather than having MCO enforce version skew policies between OS managed
397406
components and operator managed components it could simply set Upgradeable=False
398407
whenever a rollout is in progress. This would preclude minor version upgrades in

0 commit comments

Comments
 (0)