Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: OCPBUGS-53427: pkg/operator/status: Drop kubelet skew guard, add RHEL guard #4956

Open
wants to merge 1 commit into
base: release-4.18
Choose a base branch
from

Conversation

wking
Copy link
Member

@wking wking commented Mar 26, 2025

Closes: OCPBUGS-53427

- What I did

The kubelet skew guards are from 1471d2c (#2658). But the Kube API server also landed similar guards in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal form MCO-guards to KAS-guards, so I'm not entirely clear on why the MCO guards landed at all. But it's convenient for me that they did, because while I'm dropping them here, I'm recycling the Node lister for a new check.

4.19 is dropping bare-RHEL support, and I want the Node lister to look for RHEL entries like:

osImage: Red Hat Enterprise Linux 8.6 (Ootpa)

but we are ok with RHCOS entries like:

osImage: Red Hat Enterprise Linux CoreOS 419.96.202503032242-0

- How to verify it

Install a 4.18 cluster with this fix. Its machine-config ClusterOperator should be Upgradeable=True. Install a bare-RHEL node. The ClusterOperator should become Upgradeable=False and complain about that node. Remove the bare-RHEL node or somehow convert it to RHCOS. The ClusterOperator should become Upgradeable=True again.

- Description for the changelog

The machine-config operator now detects bare-RHEL Nodes and warns that they will not be compatible with OpenShift 4.19.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 26, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 26, 2025
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references Jira Issue OCPBUGS-53427, which is invalid:

  • release note text must be set and not match the template OR release note type must be set to "Release Note Not Required". For more information you can reference the OpenShift Bug Process.
  • expected Jira Issue OCPBUGS-53427 to depend on a bug targeting a version in 4.19.0 and in one of the following states: VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE), CLOSED (DONE-ERRATA), but no dependents were found

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Closes: OCPBUGS-53427

- What I did

The kubelet skew guards are from 1471d2c (#2658). But the Kube API server also landed similar guards in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (openshift/cluster-kube-apiserver-operator#1199).
/enhancements@0ba744e750 (openshift/enhancements#762) had shifted the proposal form MCO-guards to KAS-guards, so I'm not entirely clear on why the MCO guards landed at all. But it's convenient for me that they did, because while I'm dropping them here, I'm recycling the Node lister for a new check.

4.19 is dropping bare-RHEL support, and I want the Node lister to look for RHEL entries like:

osImage: Red Hat Enterprise Linux 8.6 (Ootpa)

but we are ok with RHCOS entries like:

osImage: Red Hat Enterprise Linux CoreOS 419.96.202503032242-0

- How to verify it

Install a 4.18 cluster with this fix. Its machine-config ClusterOperator should be Upgradeable=True. Install a bare-RHEL node. The ClusterOperator should become Upgradeable=False and complain about that node. Remove the bare-RHEL node or somehow convert it to RHCOS. The ClusterOperator should become Upgradeable=True again.

- Description for the changelog

The machine-config operator now detects bare-RHEL Nodes and warns that they will not be compatible with OpenShift 4.19.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Mar 26, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: wking
Once this PR has been reviewed and has the lgtm label, please assign lorbuschris for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

var (
lastError error
kubeletVersion string
)
nodes, err := optr.GetAllManagedNodes(pools)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of being a WIP, should this be GetAllManagedNodes, or should I be looking at all nodes regardless of management? Or maybe just looking at unmanaged nodes? Or...?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every place in the MCO where we listing nodes, we're looking at ones that belong to a given MachineConfigPool, which generally means they're managed. So, for the sake of consistency, this should probably be GetAllManagedNodes(). I could be persuaded otherwise though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about bare-RHEL though; are those Nodes also returned by GetAllManagedNodes?

if len(rhelNodes) > 0 {
coStatus.Status = convigv1.ConditionFalse
coStatus.Reason = "RHELNodes"
coStatus.Message = fmt.Sprintf("%s RHEL nodes, including %s, but OpenShift 4.19 requires RHCOS https://FIXME-DOC-LINK", len(rhelNodes), rhelNodes[0])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any ideas where I should be linking folks running 4.18 today for "bare-RHEL is gone in 4.19, and here's how we recommend you migrate..."?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSDOCS-12979 is probably for this. cc @gpei

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhou1 yes, that's the Doc jira issue tracking it

@@ -525,83 +501,33 @@ func (optr *Operator) cfeEvalCgroupsV1() (bool, error) {
return nodeClusterConfig.Spec.CgroupMode == configv1.CgroupModeV1, nil
}

// isKubeletSkewSupported checks the version skew of kube-apiserver and node kubelet version.
// Returns the skew status. version skew > 2 is not supported.
func (optr *Operator) isKubeletSkewSupported(pools []*mcfgv1.MachineConfigPool) (skewStatus string, coStatus configv1.ClusterOperatorStatusCondition, err error) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pitch dropping this in my commit message, but thoughts about whether I should drop it in the dev branch? I'd like to, but looking for MCO-maintainer opinions first.

And then, if the MCO-maintainer opinion is "yes, please drop this in dev", I'd like MCO-maintainer opinions on whether you want me to use this same bug-series (because we'll need a dev bug anyway that says "4.19 won't need the bare-RHEL guard, because that's only a 4.18 -> 4.19 issue, so only 4.18's MCO needs that guard"), or if you want it NO-ISSUE, or you want a separate OCPBUGS series?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that we can drop this in the dev branch. And since we'll need a bug, I think it's fine to use the same bug-series.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #4970 to drop this guard in the dev branch, and I'll rebase this pull and close this thread once that's been kicked around, merged, and verified.

@wking wking force-pushed the only-rhcos-on-4.19 branch 2 times, most recently from 68256ae to 5fc0354 Compare March 27, 2025 15:46
var (
lastError error
kubeletVersion string
)
nodes, err := optr.GetAllManagedNodes(pools)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every place in the MCO where we listing nodes, we're looking at ones that belong to a given MachineConfigPool, which generally means they're managed. So, for the sake of consistency, this should probably be GetAllManagedNodes(). I could be persuaded otherwise though.

if kubeletVersion == "" {
continue
osImage := node.Status.NodeInfo.OSImage
if strings.HasPrefix(osImage, "Red Hat Enterprise Linux") && !strings.HasPrefix(osImage, "Red Hat Enterprise Linux CoreOS") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: We should consider OKD as well. Maybe something like this instead?

if (strings.HasPrefix(osImage, "Red Hat Enterprise Linux") || strings.HasPrefix("CentOS Stream")) && !strings.Contains(osImage, "CoreOS") {
     // Do stuff.
}

For the record, we only support OKD on SCOS (CentOS Stream CoreOS) now, so no need to worry about the Fedora CoreOS case here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how OKD vs. OCP comes into this. Aren't bare-RHEL Nodes something that users add on their own, and so something that they could have added to OKD or OCP clusters. RHCOS and SCOS are what the cluster will use when creating Machines on its own and managing kubelet versions on its own, but those Machines/Nodes are fine, and we're just looking for Nodes where the OS/kubelet is being managed directly by the user for this 4.18 -> 4.19 guard.

@@ -525,83 +501,33 @@ func (optr *Operator) cfeEvalCgroupsV1() (bool, error) {
return nodeClusterConfig.Spec.CgroupMode == configv1.CgroupModeV1, nil
}

// isKubeletSkewSupported checks the version skew of kube-apiserver and node kubelet version.
// Returns the skew status. version skew > 2 is not supported.
func (optr *Operator) isKubeletSkewSupported(pools []*mcfgv1.MachineConfigPool) (skewStatus string, coStatus configv1.ClusterOperatorStatusCondition, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that we can drop this in the dev branch. And since we'll need a bug, I think it's fine to use the same bug-series.

The kubelet skew guards are from 1471d2c (Bug 1986453: Check for
API server and node versions skew, 2021-07-27, openshift#2658).  But the Kube
API server also landed similar guards in
openshift/cluster-kube-apiserver-operator@9ce4f74775 (add
KubeletVersionSkewController, 2021-08-26,
openshift/cluster-kube-apiserver-operator#1199).
openshift/enhancements@0ba744e750 (eus-upgrades-mvp: don't enforce
skew check in MCO, 2021-04-29, openshift/enhancements#762) had shifted
the proposal form MCO-guards to KAS-guards, so I'm not entirely clear
on why the MCO guards landed at all.  But it's convenient for me that
they did, because while I'm dropping them here, I'm recycling the Node
lister for a new check.

4.19 is dropping bare-RHEL support, and I want the Node lister to look
for RHEL entries like:

  osImage: Red Hat Enterprise Linux 8.6 (Ootpa)

but we are ok with RHCOS entries like:

  osImage: Red Hat Enterprise Linux CoreOS 419.96.202503032242-0
@wking wking force-pushed the only-rhcos-on-4.19 branch from 5fc0354 to 9915680 Compare April 3, 2025 21:10
Copy link
Contributor

openshift-ci bot commented Apr 4, 2025

@wking: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-vsphere-ovn-upi 9915680 link false /test e2e-vsphere-ovn-upi
ci/prow/verify 9915680 link true /test verify
ci/prow/e2e-azure-ovn-upgrade-out-of-change 9915680 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-op-techpreview 9915680 link false /test e2e-gcp-op-techpreview
ci/prow/e2e-vsphere-ovn-upi-zones 9915680 link false /test e2e-vsphere-ovn-upi-zones
ci/prow/unit 9915680 link true /test unit

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants