Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-53408: wait for build and ensure OS image is actually new #4924

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cheesesashimi
Copy link
Member

@cheesesashimi cheesesashimi commented Mar 18, 2025

- What I did

Occasionally, there is a delay between the time that a new rendered MachineConfig is produced and OCL begins a build. However, a couple of things happen in the interim:

  • The RenderController updates the MachineConfigPool. Because of the delay mentioned above, the NodeController begins updating all of the nodes with only the new rendered MachineConfig. The OS image remains the same because the NodeController is not ensuring that the image pullspec on the MachineOSConfig is the same as the MachineOSBuild.
  • Because of the work done to the MCD in OCPBUGS-43896: add revert logic to OCL path in MCD #4825, the original check that we had to determine whether the image pullspecs were the same is no longer present. Additionally, the logic change there makes it possible for an OS update to always occur whenever OCL is enabled, further bypassing that check.

This fixes the behavior by making the following changes.

  1. Update the Node Controller to ensure that the both the MachineOSBuild's MachineConfig reference matches the MCP's current rendered MachineConfig while also checking that the MachineOSConfig's image pullspec matches the MachineOSBuild's. In the situation where the MachineOSBuild's pullspec is empty, this check will fail and the Node Controller will requeue.
  2. Update the MCD so that even when OCL is enabled, if the OS images are the same, the OS update process is skipped.

- How to verify it

This bug is difficult to reproduce. However, it seems to occur when multiple OCL-enabled pools are updating simultaneously.

- Description for the changelog
Wait for build and ensure OS image is actually new

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2025
Copy link
Contributor

openshift-ci bot commented Mar 18, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Mar 18, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheesesashimi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 18, 2025
@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch 2 times, most recently from cab64f7 to 8d2ede1 Compare March 21, 2025 16:24
@cheesesashimi cheesesashimi changed the title check for image pullspec and don't consider OCL for OS updates OCPBUGS-53408: wait for build and ensure OS image is actually new Mar 21, 2025
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 21, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-53408, which is invalid:

  • expected the bug to target the "4.19.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@cheesesashimi
Copy link
Member Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Mar 21, 2025
@cheesesashimi cheesesashimi marked this pull request as ready for review March 21, 2025 16:29
@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Mar 21, 2025
@openshift-ci-robot
Copy link
Contributor

@cheesesashimi: This pull request references Jira Issue OCPBUGS-53408, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025
@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch from 8d2ede1 to e121461 Compare March 21, 2025 19:51
@cheesesashimi
Copy link
Member Author

/retest required

Copy link
Contributor

openshift-ci bot commented Mar 26, 2025

@cheesesashimi: The /retest command does not accept any targets.
The following commands are available to trigger required jobs:

/test e2e-aws-ovn
/test e2e-aws-ovn-upgrade
/test e2e-gcp-op
/test e2e-gcp-op-single-node
/test e2e-hypershift
/test images
/test unit
/test verify

The following commands are available to trigger optional jobs:

/test bootstrap-unit
/test e2e-agent-compact-ipv4
/test e2e-aws-disruptive
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-fips-op
/test e2e-aws-ovn-ocb-techpreview
/test e2e-aws-ovn-upgrade-ocb-techpreview
/test e2e-aws-ovn-upgrade-out-of-change
/test e2e-aws-ovn-workers-rhel8
/test e2e-aws-proxy
/test e2e-aws-serial
/test e2e-aws-single-node
/test e2e-aws-upgrade-single-node
/test e2e-aws-workers-rhel8
/test e2e-azure
/test e2e-azure-ovn-upgrade
/test e2e-azure-ovn-upgrade-out-of-change
/test e2e-azure-upgrade
/test e2e-gcp-op-ocl
/test e2e-gcp-op-techpreview
/test e2e-gcp-ovn-rt-upgrade
/test e2e-gcp-rt
/test e2e-gcp-rt-op
/test e2e-gcp-single-node
/test e2e-gcp-upgrade
/test e2e-metal-assisted
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-openstack
/test e2e-openstack-dualstack
/test e2e-openstack-externallb
/test e2e-openstack-hypershift
/test e2e-openstack-parallel
/test e2e-openstack-singlestackv6
/test e2e-ovirt
/test e2e-ovirt-upgrade
/test e2e-ovn-step-registry
/test e2e-vsphere
/test e2e-vsphere-ovn-upi
/test e2e-vsphere-ovn-upi-zones
/test e2e-vsphere-ovn-zones
/test e2e-vsphere-upgrade
/test okd-e2e-aws
/test okd-e2e-gcp-op
/test okd-e2e-upgrade
/test okd-e2e-vsphere
/test okd-images
/test okd-scos-e2e-aws-ovn
/test okd-scos-images
/test security

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-machine-config-operator-main-bootstrap-unit
pull-ci-openshift-machine-config-operator-main-e2e-agent-compact-ipv4
pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn-upgrade
pull-ci-openshift-machine-config-operator-main-e2e-aws-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-main-e2e-azure-ovn-upgrade-out-of-change
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-ocl
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-single-node
pull-ci-openshift-machine-config-operator-main-e2e-gcp-op-techpreview
pull-ci-openshift-machine-config-operator-main-e2e-hypershift
pull-ci-openshift-machine-config-operator-main-images
pull-ci-openshift-machine-config-operator-main-okd-scos-e2e-aws-ovn
pull-ci-openshift-machine-config-operator-main-security
pull-ci-openshift-machine-config-operator-main-unit
pull-ci-openshift-machine-config-operator-main-verify

In response to this:

/retest required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch 6 times, most recently from 6723c30 to b398da2 Compare April 1, 2025 19:55
Occasionally, there is a delay between the time that a new rendered
MachineConfig is produced and OCL begins a build. However, a couple of
things happen in the interim:

- The RenderController updates the MachineConfigPool. Because of the
  delay mentioned above, the NodeController begins updating all of the
  nodes with only the new rendered MachineConfig. The OS image remains
  the same because the NodeController is not ensuring that the image
  pullspec on the MachineOSConfig is the same as the MachineOSBuild.
- Because of the work done to the MCD in
  openshift#4825, the
  original check that we had to determine whether the image pullspecs
  were the same is no longer present. Additionally, the logic change
  there makes it possible for an OS update to always occur whenever OCL
  is enabled, further bypassing that check.

This fixes that by doing two things:

1. Update the Node Controller to ensure that the both the
   MachineOSBuild's MachineConfig reference matches the MCP's current
   rendered MachineConfig while also checking that the MachineOSConfig's
   image pullspec matches the MachineOSBuild's. In the situation where
   the MachineOSBuild's pullspec is empty, this check will fail and the
   Node Controller will requeue.
2. Update the MCD so that even when OCL is enabled, if the OS images are
   the same, the OS update process is skipped.
@cheesesashimi cheesesashimi force-pushed the zzlotnik/fix-mcd-and-node-controller branch from b398da2 to c805f52 Compare April 2, 2025 13:45
Copy link
Contributor

openshift-ci bot commented Apr 2, 2025

@cheesesashimi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op c805f52 link true /test e2e-gcp-op
ci/prow/e2e-gcp-op-single-node c805f52 link true /test e2e-gcp-op-single-node
ci/prow/e2e-aws-ovn-upgrade c805f52 link true /test e2e-aws-ovn-upgrade
ci/prow/e2e-azure-ovn-upgrade-out-of-change c805f52 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-aws-ovn c805f52 link true /test e2e-aws-ovn
ci/prow/e2e-hypershift c805f52 link true /test e2e-hypershift
ci/prow/e2e-gcp-op-techpreview c805f52 link false /test e2e-gcp-op-techpreview
ci/prow/verify c805f52 link true /test verify
ci/prow/e2e-gcp-op-ocl c805f52 link false /test e2e-gcp-op-ocl
ci/prow/bootstrap-unit c805f52 link false /test bootstrap-unit
ci/prow/e2e-agent-compact-ipv4 c805f52 link false /test e2e-agent-compact-ipv4

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants