OCPBUGS-51273: Don't crashloop for HAProxy init container #4963

cybertron · 2025-03-31T20:39:56Z

Previously we just crashlooped when the HAProxy init container failed, which is a normal, expected condition when HAProxy starts before CoreDNS. This is causing issues in CI because having a pod crash more than 3 times in a row is considered a failure. While it usually doesn't take that long for it to pass, we are hitting a weird timing issue during upgrades when the node is just about to reboot after MCO updates the pod definitions and it's taking longer than normal because ostree is updating the node at the same time.

Since this is just a case of everything behaving as expected, let's stop failing the pod for an expected situation. This change puts the api-int call in a loop so it will just run until coredns is ready and we'll never trigger any error reporting just because of harmless timing issues.

- What I did

- How to verify it

- Description for the changelog

Previously we just crashlooped when the HAProxy init container failed, which is a normal, expected condition when HAProxy starts before CoreDNS. This is causing issues in CI because having a pod crash more than 3 times in a row is considered a failure. While it usually doesn't take that long for it to pass, we are hitting a weird timing issue during upgrades when the node is just about to reboot after MCO updates the pod definitions and it's taking longer than normal because ostree is updating the node at the same time. Since this is just a case of everything behaving as expected, let's stop failing the pod for an expected situation. This change puts the api-int call in a loop so it will just run until coredns is ready and we'll never trigger any error reporting just because of harmless timing issues.

openshift-ci-robot · 2025-03-31T20:40:06Z

@cybertron: This pull request references Jira Issue OCPBUGS-51273, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.19.0) matches configured target version for branch (4.19.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jadhaj

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Previously we just crashlooped when the HAProxy init container failed, which is a normal, expected condition when HAProxy starts before CoreDNS. This is causing issues in CI because having a pod crash more than 3 times in a row is considered a failure. While it usually doesn't take that long for it to pass, we are hitting a weird timing issue during upgrades when the node is just about to reboot after MCO updates the pod definitions and it's taking longer than normal because ostree is updating the node at the same time.

Since this is just a case of everything behaving as expected, let's stop failing the pod for an expected situation. This change puts the api-int call in a loop so it will just run until coredns is ready and we'll never trigger any error reporting just because of harmless timing issues.

- What I did

- How to verify it

- Description for the changelog

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2025-03-31T20:40:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cybertron
Once this PR has been reviewed and has the lgtm label, please assign dkhater-redhat for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

cybertron · 2025-04-01T13:27:57Z

/retest-required

Not used in hypershift.

cybertron · 2025-04-02T18:58:09Z

/retest-required

openshift-ci · 2025-04-02T21:24:52Z

@cybertron: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-op-ocl	`90907d5`	link	false	`/test e2e-gcp-op-ocl`
ci/prow/bootstrap-unit	`90907d5`	link	false	`/test bootstrap-unit`
ci/prow/e2e-azure-ovn-upgrade-out-of-change	`90907d5`	link	false	`/test e2e-azure-ovn-upgrade-out-of-change`
ci/prow/e2e-hypershift	`90907d5`	link	true	`/test e2e-hypershift`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot requested review from jadhaj, rvanderp3 and stephenfin March 31, 2025 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-51273: Don't crashloop for HAProxy init container #4963

OCPBUGS-51273: Don't crashloop for HAProxy init container #4963

cybertron commented Mar 31, 2025

openshift-ci-robot commented Mar 31, 2025

openshift-ci bot commented Mar 31, 2025

cybertron commented Apr 1, 2025

cybertron commented Apr 2, 2025

openshift-ci bot commented Apr 2, 2025

OCPBUGS-51273: Don't crashloop for HAProxy init container #4963

Are you sure you want to change the base?

OCPBUGS-51273: Don't crashloop for HAProxy init container #4963

Conversation

cybertron commented Mar 31, 2025

openshift-ci-robot commented Mar 31, 2025

openshift-ci bot commented Mar 31, 2025

cybertron commented Apr 1, 2025

cybertron commented Apr 2, 2025

openshift-ci bot commented Apr 2, 2025