Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPNODE-3029: WIP: handle required minimum kubelet version featuregate rollout #4929

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

haircommander
Copy link
Member

- What I did
When a cluster admin rolls out a minimum kubelet verison, the cluster-config-operator (eventually) will render
a new set of featuregates based on the new minimum version (enabling ones that are safe given the new minimum).

However, on a rollback, we need to make sure the MCS won't return a rendered config using the old features, in the very
odd case a cluster admin creates a newly old node and somehow overrides the osImage (unlikely). Thus, we use the similar condition
as MCS to serve the config--once one node has updated, we treat the MCP as using the new config. Then, we write the status to the
node config object so the node authorization plugin can allow older nodes that are now deemed safe.

- How to verify it

- Description for the changelog

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2025
@openshift-ci openshift-ci bot requested review from umohnani8 and yuqi-zhang March 19, 2025 20:43
Copy link
Contributor

openshift-ci bot commented Mar 19, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haircommander
Once this PR has been reviewed and has the lgtm label, please assign djoshy for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@haircommander
Copy link
Member Author

some background openshift/enhancements#1766

@haircommander haircommander changed the title WIP: handle required minimum kubelet version featuregate rollout OCPNODE-3029: WIP: handle required minimum kubelet version featuregate rollout Mar 19, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 19, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 19, 2025
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Mar 19, 2025

@haircommander: This pull request references OCPNODE-3029 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the spike to target the "4.19.0" version, but no target version was set.

In response to this:

- What I did
When a cluster admin rolls out a minimum kubelet verison, the cluster-config-operator (eventually) will render
a new set of featuregates based on the new minimum version (enabling ones that are safe given the new minimum).

However, on a rollback, we need to make sure the MCS won't return a rendered config using the old features, in the very
odd case a cluster admin creates a newly old node and somehow overrides the osImage (unlikely). Thus, we use the similar condition
as MCS to serve the config--once one node has updated, we treat the MCP as using the new config. Then, we write the status to the
node config object so the node authorization plugin can allow older nodes that are now deemed safe.

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 21, 2025
}
go ctrl.writebackMinimumKubeletVersionIfAppropriate(updatedPools, renderedVersions, nodeConfig, func() ([]*mcfgv1.MachineConfigPool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason this is a separate goroutine?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it could be async in case it takes a while to have the mcps rollback and I didn't know what could be blocked by having it sync

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better solution might be to push this onto one of the work queues by doing something like this:

func (ctrl *Controller) syncKubeletConfig(key string) error {
	// Key lookup stuff above here.
	// Here, we detect that we need to do this for the current kubeletconfig, so we just kick that off.
	if ctrl.writeMinimumKubeletVersion[kubeletCfg.Name] {
		defer func() {
			delete(ctrl.writeMinimumKubeletVersion, kubeletCfg.Name)
		}()
		return ctrl.writebackMinimumKubeletVersionIfAppropriate(...)
	}

	//
	// Bulk of function above here.
	//

	if ctrl.isMinimumKubeletVersionWritebackNeeded(...) {
		// Here, we update our internal state to indicate that we need to perform this action
		// and enqueue the action.
		ctrl.writeMinimumKubeletVersion[kubeletcfg.Name] = true
		return ctrl.enqueue(kubeletCfg)
	}

	// End of function
}

We'll probably want to use a sync.Map since the work queue has multiple workers that could mutate the controller struct state at any given time.

if node.Spec.MinimumKubeletVersion == node.Status.MinimumKubeletVersion ||
node.Status.MinimumKubeletVersion == renderedKubeletVersion ||
node.Spec.MinimumKubeletVersion != renderedKubeletVersion {
klog.InfoS("Skipping writeback to nodes.config.Status.MinimumKubeletVersion because situation not correct",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: why are these conditions not correct? isn't this first one just saying that the spec and status match what we expect?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the status matches the spec, no need to do the update (though I probably should actually check renderedKubeletVersion there too)
for the second, if the status already matches the rendered then we've already done the write back for this rendered version
for the third: we're rendering a different version than what is set in the spec..
I think I have to rework this condition though

for _, mcp := range mcps {
if mcp.Status.UpdatedMachineCount > 0 {
poolsUpdated[mcp.Name] = true
} else if _, updated := poolsUpdated[mcp.Name]; !updated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the logic here. The if condition checks if at least 1 node is updated, and if not, whether a previous loop had updated it? Wouldn't it be easier to check if the mcp itself is updated directly, and then break out of the loop if any pool doesn't meet either condition?

Note that we should be careful of pools with no nodes in them, which I think would be covered if we copy the server check

@haircommander haircommander force-pushed the required-minimum-kubelet-version branch from bdece5e to b6c210d Compare March 26, 2025 15:51
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 26, 2025
When a cluster admin rolls out a minimum kubelet verison, the cluster-config-operator (eventually) will render
a new set of featuregates based on the new minimum version (enabling ones that are safe given the new minimum).

However, on a rollback, we need to make sure the MCS won't return a rendered config using the old features, in the very
odd case a cluster admin creates a newly old node and somehow overrides the osImage (unlikely). Thus, we use the similar condition
as MCS to serve the config--once one node has updated, we treat the MCP as using the new config. Then, we write the status to the
node config object so the node authorization plugin can allow older nodes that are now deemed safe.

Signed-off-by: Peter Hunt <[email protected]>
@haircommander haircommander force-pushed the required-minimum-kubelet-version branch from b6c210d to c0c5483 Compare March 26, 2025 19:51
Copy link
Contributor

openshift-ci bot commented Mar 27, 2025

@haircommander: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/bootstrap-unit c0c5483 link false /test bootstrap-unit
ci/prow/unit c0c5483 link true /test unit
ci/prow/e2e-aws-ovn c0c5483 link true /test e2e-aws-ovn
ci/prow/verify c0c5483 link true /test verify
ci/prow/e2e-hypershift c0c5483 link true /test e2e-hypershift
ci/prow/e2e-azure-ovn-upgrade-out-of-change c0c5483 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/okd-scos-e2e-aws-ovn c0c5483 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-upgrade c0c5483 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-gcp-op-ocl c0c5483 link false /test e2e-gcp-op-ocl
ci/prow/e2e-aws-ovn-upgrade-out-of-change c0c5483 link false /test e2e-aws-ovn-upgrade-out-of-change

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

}
go ctrl.writebackMinimumKubeletVersionIfAppropriate(updatedPools, renderedVersions, nodeConfig, func() ([]*mcfgv1.MachineConfigPool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better solution might be to push this onto one of the work queues by doing something like this:

func (ctrl *Controller) syncKubeletConfig(key string) error {
	// Key lookup stuff above here.
	// Here, we detect that we need to do this for the current kubeletconfig, so we just kick that off.
	if ctrl.writeMinimumKubeletVersion[kubeletCfg.Name] {
		defer func() {
			delete(ctrl.writeMinimumKubeletVersion, kubeletCfg.Name)
		}()
		return ctrl.writebackMinimumKubeletVersionIfAppropriate(...)
	}

	//
	// Bulk of function above here.
	//

	if ctrl.isMinimumKubeletVersionWritebackNeeded(...) {
		// Here, we update our internal state to indicate that we need to perform this action
		// and enqueue the action.
		ctrl.writeMinimumKubeletVersion[kubeletcfg.Name] = true
		return ctrl.enqueue(kubeletCfg)
	}

	// End of function
}

We'll probably want to use a sync.Map since the work queue has multiple workers that could mutate the controller struct state at any given time.

}

node.Status.MinimumKubeletVersion = renderedKubeletVersion
if _, err := ctrl.configClient.ConfigV1().Nodes().Update(context.TODO(), node, metav1.UpdateOptions{}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: Since it's possible that multiple goroutines could be running this simultaneously, it is possible that this object could be updated before we call .Update(). A potential solution is to wrap everything within this function in a RetryOnConflict(): https://github.com/kubernetes/client-go/blob/master/util/retry/util.go#L103 .

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2025
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants