-
Notifications
You must be signed in to change notification settings - Fork 419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: OCPBUGS-53427: pkg/operator/status: Drop kubelet skew guard, add RHEL guard #4956
base: release-4.18
Are you sure you want to change the base?
Conversation
@wking: This pull request references Jira Issue OCPBUGS-53427, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: wking The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
var ( | ||
lastError error | ||
kubeletVersion string | ||
) | ||
nodes, err := optr.GetAllManagedNodes(pools) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As part of being a WIP, should this be GetAllManagedNodes
, or should I be looking at all nodes regardless of management? Or maybe just looking at unmanaged nodes? Or...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every place in the MCO where we listing nodes, we're looking at ones that belong to a given MachineConfigPool, which generally means they're managed. So, for the sake of consistency, this should probably be GetAllManagedNodes()
. I could be persuaded otherwise though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough about bare-RHEL though; are those Nodes also returned by GetAllManagedNodes
?
pkg/operator/status.go
Outdated
if len(rhelNodes) > 0 { | ||
coStatus.Status = convigv1.ConditionFalse | ||
coStatus.Reason = "RHELNodes" | ||
coStatus.Message = fmt.Sprintf("%s RHEL nodes, including %s, but OpenShift 4.19 requires RHCOS https://FIXME-DOC-LINK", len(rhelNodes), rhelNodes[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any ideas where I should be linking folks running 4.18 today for "bare-RHEL is gone in 4.19, and here's how we recommend you migrate..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OSDOCS-12979 is probably for this. cc @gpei
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhou1 yes, that's the Doc jira issue tracking it
@@ -525,83 +501,33 @@ func (optr *Operator) cfeEvalCgroupsV1() (bool, error) { | |||
return nodeClusterConfig.Spec.CgroupMode == configv1.CgroupModeV1, nil | |||
} | |||
|
|||
// isKubeletSkewSupported checks the version skew of kube-apiserver and node kubelet version. | |||
// Returns the skew status. version skew > 2 is not supported. | |||
func (optr *Operator) isKubeletSkewSupported(pools []*mcfgv1.MachineConfigPool) (skewStatus string, coStatus configv1.ClusterOperatorStatusCondition, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pitch dropping this in my commit message, but thoughts about whether I should drop it in the dev branch? I'd like to, but looking for MCO-maintainer opinions first.
And then, if the MCO-maintainer opinion is "yes, please drop this in dev", I'd like MCO-maintainer opinions on whether you want me to use this same bug-series (because we'll need a dev bug anyway that says "4.19 won't need the bare-RHEL guard, because that's only a 4.18 -> 4.19 issue, so only 4.18's MCO needs that guard"), or if you want it NO-ISSUE, or you want a separate OCPBUGS series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that we can drop this in the dev branch. And since we'll need a bug, I think it's fine to use the same bug-series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #4970 to drop this guard in the dev branch, and I'll rebase this pull and close this thread once that's been kicked around, merged, and verified.
68256ae
to
5fc0354
Compare
var ( | ||
lastError error | ||
kubeletVersion string | ||
) | ||
nodes, err := optr.GetAllManagedNodes(pools) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every place in the MCO where we listing nodes, we're looking at ones that belong to a given MachineConfigPool, which generally means they're managed. So, for the sake of consistency, this should probably be GetAllManagedNodes()
. I could be persuaded otherwise though.
if kubeletVersion == "" { | ||
continue | ||
osImage := node.Status.NodeInfo.OSImage | ||
if strings.HasPrefix(osImage, "Red Hat Enterprise Linux") && !strings.HasPrefix(osImage, "Red Hat Enterprise Linux CoreOS") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: We should consider OKD as well. Maybe something like this instead?
if (strings.HasPrefix(osImage, "Red Hat Enterprise Linux") || strings.HasPrefix("CentOS Stream")) && !strings.Contains(osImage, "CoreOS") {
// Do stuff.
}
For the record, we only support OKD on SCOS (CentOS Stream CoreOS) now, so no need to worry about the Fedora CoreOS case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how OKD vs. OCP comes into this. Aren't bare-RHEL Nodes something that users add on their own, and so something that they could have added to OKD or OCP clusters. RHCOS and SCOS are what the cluster will use when creating Machines on its own and managing kubelet versions on its own, but those Machines/Nodes are fine, and we're just looking for Nodes where the OS/kubelet is being managed directly by the user for this 4.18 -> 4.19 guard.
@@ -525,83 +501,33 @@ func (optr *Operator) cfeEvalCgroupsV1() (bool, error) { | |||
return nodeClusterConfig.Spec.CgroupMode == configv1.CgroupModeV1, nil | |||
} | |||
|
|||
// isKubeletSkewSupported checks the version skew of kube-apiserver and node kubelet version. | |||
// Returns the skew status. version skew > 2 is not supported. | |||
func (optr *Operator) isKubeletSkewSupported(pools []*mcfgv1.MachineConfigPool) (skewStatus string, coStatus configv1.ClusterOperatorStatusCondition, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that we can drop this in the dev branch. And since we'll need a bug, I think it's fine to use the same bug-series.
The kubelet skew guards are from 1471d2c (Bug 1986453: Check for API server and node versions skew, 2021-07-27, openshift#2658). But the Kube API server also landed similar guards in openshift/cluster-kube-apiserver-operator@9ce4f74775 (add KubeletVersionSkewController, 2021-08-26, openshift/cluster-kube-apiserver-operator#1199). openshift/enhancements@0ba744e750 (eus-upgrades-mvp: don't enforce skew check in MCO, 2021-04-29, openshift/enhancements#762) had shifted the proposal form MCO-guards to KAS-guards, so I'm not entirely clear on why the MCO guards landed at all. But it's convenient for me that they did, because while I'm dropping them here, I'm recycling the Node lister for a new check. 4.19 is dropping bare-RHEL support, and I want the Node lister to look for RHEL entries like: osImage: Red Hat Enterprise Linux 8.6 (Ootpa) but we are ok with RHCOS entries like: osImage: Red Hat Enterprise Linux CoreOS 419.96.202503032242-0
5fc0354
to
9915680
Compare
@wking: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Closes: OCPBUGS-53427
- What I did
The kubelet skew guards are from 1471d2c (#2658). But the Kube API server also landed similar guards in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal form MCO-guards to KAS-guards, so I'm not entirely clear on why the MCO guards landed at all. But it's convenient for me that they did, because while I'm dropping them here, I'm recycling the Node lister for a new check.
4.19 is dropping bare-RHEL support, and I want the Node lister to look for RHEL entries like:
but we are ok with RHCOS entries like:
- How to verify it
Install a 4.18 cluster with this fix. Its
machine-config
ClusterOperator should beUpgradeable=True
. Install a bare-RHEL node. The ClusterOperator should becomeUpgradeable=False
and complain about that node. Remove the bare-RHEL node or somehow convert it to RHCOS. The ClusterOperator should becomeUpgradeable=True
again.- Description for the changelog
The machine-config operator now detects bare-RHEL Nodes and warns that they will not be compatible with OpenShift 4.19.