OCPQE-27285: Fix the failures in qe ci jobs #374

sunzhaohua2 · 2025-01-21T02:12:45Z

Fixed some issues found in qe ci jobs, @huali9 @miyadav @shellyyang1989 PTAL

When one case failed, other cases will be interrupted, set --fail-fast=false so that other cases can be tested.
Spot case failed on arm clusters, add instance types for arm clusters
Update GetArchitectureFromMachineSetNodes to get arch from annotation
Skip spot case on customer vpc clusters, as error:

termination-simulator-p862v                           1/1     Running   0              31s
$ oc logs -f termination-simulator-p862v          
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/aarch64/APKINDEX.tar.gz

If machine creation failed, then stop waiting.
Some fields are not default for customer vpc cluster, add them when creating machine from a minimal providerSpec

openshift-ci-robot · 2025-01-21T02:12:48Z

@sunzhaohua2: This pull request references OCPQE-27285 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Fixed some issues found in qe ci jobs, @huali9 @miyadav @shellyyang1989 PTAL

When one case failed, other cases will be interrupted, set --fail-fast=false so that other cases can be tested.

Spot case failed on arm clusters, add instance types for arm clusters

Update GetArchitectureFromMachineSetNodes to get arch from annotation

Skip spot case on disconnected clusters, as error:
termination-simulator-p862v                           1/1     Running   0              31s
$ oc logs -f termination-simulator-p862v          
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/aarch64/APKINDEX.tar.gz
Pushed golang image to internal registry so that disconnected cluster can access

Webhook if machine creation failed, then stop waiting.

Some field are not default for azure customer vpc cluster, add them when creating machine from a minimal providerSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2025-01-21T02:14:07Z

@sunzhaohua2: This pull request references OCPQE-27285 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Fixed some issues found in qe ci jobs, @huali9 @miyadav @shellyyang1989 PTAL

When one case failed, other cases will be interrupted, set --fail-fast=false so that other cases can be tested.

Spot case failed on arm clusters, add instance types for arm clusters

Update GetArchitectureFromMachineSetNodes to get arch from annotation

Skip spot case on disconnected clusters, as error:
termination-simulator-p862v                           1/1     Running   0              31s
$ oc logs -f termination-simulator-p862v          
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/aarch64/APKINDEX.tar.gz
Pushed golang image to internal registry so that disconnected cluster can access

Webhook if machine creation failed, then stop waiting.

Some fields are not default for customer vpc cluster, add them when creating machine from a minimal providerSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

miyadav · 2025-01-21T03:56:12Z

Thanks @sunzhaohua2 , I was trying if we can build image from this PR and test it like a pre-merged but , it won't that way , may be we can merge , do you think of any other way to test ?
{ could not resolve inputs: could not determine inputs for step [input:cluster-api-actuator-pkg-test]: could not resolve base image from ci-ln-tfckgw2/cluster-api-actuator-pkg-test:latest: imagestreamtags.image.openshift.io "cluster-api-actuator-pkg-test" not found}

sunzhaohua2 · 2025-01-21T04:38:53Z

Thanks @sunzhaohua2 , I was trying if we can build image from this PR and test it like a pre-merged but , it won't that way , may be we can merge , do you think of any other way to test ? { could not resolve inputs: could not determine inputs for step [input:cluster-api-actuator-pkg-test]: could not resolve base image from ci-ln-tfckgw2/cluster-api-actuator-pkg-test:latest: imagestreamtags.image.openshift.io "cluster-api-actuator-pkg-test" not found}

Thanks Milind for helping checking, I tested locally with below commands, all cases can be tested even if there are failed. For the failed ones I tested locally and passed on azure versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm
$ hack/ci-integration.sh --junit-report=junit_cluster-api-actuator-testutils.xml --label-filter='disruptive' -p > disruptive.log

miyadav · 2025-01-21T09:59:38Z

Thanks @sunzhaohua2 , I was trying if we can build image from this PR and test it like a pre-merged but , it won't that way , may be we can merge , do you think of any other way to test ? { could not resolve inputs: could not determine inputs for step [input:cluster-api-actuator-pkg-test]: could not resolve base image from ci-ln-tfckgw2/cluster-api-actuator-pkg-test:latest: imagestreamtags.image.openshift.io "cluster-api-actuator-pkg-test" not found}

Thanks Milind for helping checking, I tested locally with below commands, all cases can be tested even if there are failed. For the failed ones I tested locally and passed on azure versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm $ hack/ci-integration.sh --junit-report=junit_cluster-api-actuator-testutils.xml --label-filter='disruptive' -p > disruptive.log

Yeah, locally worked well for me too , thanks for checking ..

/lgtm

sunzhaohua2 · 2025-02-05T03:10:25Z

@JoelSpeed can you help to take a look when you have time?

JoelSpeed · 2025-02-05T11:16:48Z

hack/ci-integration.sh

@@ -7,7 +7,7 @@ go run ./vendor/github.com/onsi/ginkgo/v2/ginkgo \
    -v \
    --timeout=115m \
    --grace-period=5m \
-    --fail-fast \
+    --fail-fast=false \


Out of interest, why do we switch to false here?

Now if one case failed, the left cases will be marked as [INTERRUPTED], I change this to continue run other test cases even if a certain test fails.

Are we sure that the failed test case won't interfere with the later test cases? It will definitely be cleaned up properly?

I wouldn't want the failure to then impact the later tests somehow

I ran the disruptive cases locally on azure versioned-installer-fully_private_cluster-proxy cluster, didn't find the failed case interfere with the later test cases, will check on aws and gcp as well.

I run several times on aws/gcp, didn't find the failed case interfere with the later test cases. switch to false here looks safe.

pkg/framework/machinesets.go

pkg/infra/spot.go

pkg/infra/webhooks.go

JoelSpeed · 2025-02-12T15:28:02Z

Apart from a pending reply on #374 (comment), I think this looks ok and we can look to get this merged once we work out this thread

sunzhaohua2 · 2025-02-13T10:43:30Z

Apart from a pending reply on #374 (comment), I think this looks ok and we can look to get this merged once we work out this thread

Thanks @JoelSpeed I ran disruptive cases on aws and gcp, didn't find failures impact the later tests.
I am checking the spot instance case, termination-simulator is not only failed in discconected clusters, but also failed in customer_vpc clusters, I am not sure how to skip customer_vpc cluster.

 $ oc logs -f termination-simulator-62htc -c iptables                                                                                                        
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
^C

I am thinking if need to change the script to make iptables and bind-tools are included in a container image, and I push it to quay.io/openshifttest, then disconnected cluster can access, or we just run this case on dev clusters and skip on qe clusters.

miyadav · 2025-02-13T11:21:06Z

Apart from a pending reply on #374 (comment), I think this looks ok and we can look to get this merged once we work out this thread

Thanks @JoelSpeed I ran disruptive cases on aws and gcp, didn't find failures impact the later tests. I am checking the spot instance case, termination-simulator is not only failed in discconected clusters, but also failed in customer_vpc clusters, I am not sure how to skip customer_vpc cluster.
 $ oc logs -f termination-simulator-62htc -c iptables                                                                                                        
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
^C
I am thinking if need to change the script to make iptables and bind-tools are included in a container image, and I push it to quay.io/openshifttest, then disconnected cluster can access, or we just run this case on dev clusters and skip on qe clusters.

JoelSpeed · 2025-02-13T11:47:21Z

am thinking if need to change the script to make iptables and bind-tools are included in a container image, and I push it to quay.io/openshifttest, then disconnected cluster can access, or we just run this case on dev clusters and skip on qe clusters.

Would this be a lot of work?

Does merging this PR in its current state make things awkward/those tests start failing anywhere? Do we need to add some skips until we can work out a solution to the spot problem?

openshift-ci-robot · 2025-02-14T07:40:00Z

@sunzhaohua2: This pull request references OCPQE-27285 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Fixed some issues found in qe ci jobs, @huali9 @miyadav @shellyyang1989 PTAL

When one case failed, other cases will be interrupted, set --fail-fast=false so that other cases can be tested.

Spot case failed on arm clusters, add instance types for arm clusters

Update GetArchitectureFromMachineSetNodes to get arch from annotation

Skip spot case on customer vpc clusters, as error:
termination-simulator-p862v                           1/1     Running   0              31s
$ oc logs -f termination-simulator-p862v          
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/aarch64/APKINDEX.tar.gz
If machine creation failed, then stop waiting.

Some fields are not default for customer vpc cluster, add them when creating machine from a minimal providerSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sunzhaohua2 · 2025-02-14T07:53:03Z

Would this be a lot of work?

Does merging this PR in its current state make things awkward/those tests start failing anywhere? Do we need to add some skips until we can work out a solution to the spot problem?

I tried to build image yesterday, I had a problem when I was using docker buildx to build image support multiple platforms. I have skipped the failed tests on customer vpc cluster and should be able to merge. I will check if I can improve it later.

JoelSpeed · 2025-02-14T10:56:57Z

/approve
/lgtm

openshift-ci · 2025-02-14T10:57:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [JoelSpeed]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2025-02-14T10:59:45Z

@sunzhaohua2: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-openstack-operator	`5f891b6`	link	false	`/test e2e-openstack-operator`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

sunzhaohua2 · 2025-02-17T03:37:03Z

/cherry-pick release-4.18

openshift-cherrypick-robot · 2025-02-17T03:37:49Z

@sunzhaohua2: new pull request created: #380

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 21, 2025

openshift-ci bot requested review from elmiko and sub-mod January 21, 2025 02:13

sunzhaohua2 force-pushed the fix-failure branch from 9fb6312 to 17ab877 Compare January 21, 2025 03:17

openshift-ci bot assigned miyadav Jan 21, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2025

JoelSpeed reviewed Feb 5, 2025

View reviewed changes

sunzhaohua2 force-pushed the fix-failure branch from 17ab877 to 8e12d3f Compare February 6, 2025 08:43

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2025

Fix the failures in qe ci jobs

5f891b6

sunzhaohua2 force-pushed the fix-failure branch from 8e12d3f to 5f891b6 Compare February 14, 2025 07:38

openshift-ci bot assigned JoelSpeed Feb 14, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2025

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2025

openshift-merge-bot bot merged commit 135fde7 into openshift:master Feb 14, 2025
8 of 9 checks passed

openshift-cherrypick-robot mentioned this pull request Feb 17, 2025

[release-4.18] NO-JIRA: Fix the failures in qe ci jobs #380

Merged

sunzhaohua2 mentioned this pull request Feb 28, 2025

[release-4.17] NO-JIRA:Fix the failures in qe ci jobs #383

Merged

miyadav mentioned this pull request Mar 11, 2025

OCPQE-27287 adding ordered argument to avoid failures #372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPQE-27285: Fix the failures in qe ci jobs #374

OCPQE-27285: Fix the failures in qe ci jobs #374

sunzhaohua2 commented Jan 21, 2025 •

edited

Loading

openshift-ci-robot commented Jan 21, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 21, 2025 •

edited by openshift-ci bot

Loading

miyadav commented Jan 21, 2025

sunzhaohua2 commented Jan 21, 2025

miyadav commented Jan 21, 2025

sunzhaohua2 commented Feb 5, 2025

JoelSpeed Feb 5, 2025

sunzhaohua2 Feb 6, 2025

JoelSpeed Feb 6, 2025

sunzhaohua2 Feb 6, 2025

sunzhaohua2 Feb 14, 2025

JoelSpeed commented Feb 12, 2025

sunzhaohua2 commented Feb 13, 2025

miyadav commented Feb 13, 2025

JoelSpeed commented Feb 13, 2025

openshift-ci-robot commented Feb 14, 2025 •

edited by openshift-ci bot

Loading

sunzhaohua2 commented Feb 14, 2025

JoelSpeed commented Feb 14, 2025

openshift-ci bot commented Feb 14, 2025

openshift-ci bot commented Feb 14, 2025

sunzhaohua2 commented Feb 17, 2025

openshift-cherrypick-robot commented Feb 17, 2025

OCPQE-27285: Fix the failures in qe ci jobs #374

OCPQE-27285: Fix the failures in qe ci jobs #374

Conversation

sunzhaohua2 commented Jan 21, 2025 • edited Loading

openshift-ci-robot commented Jan 21, 2025 • edited by openshift-ci bot Loading

openshift-ci-robot commented Jan 21, 2025 • edited by openshift-ci bot Loading

miyadav commented Jan 21, 2025

sunzhaohua2 commented Jan 21, 2025

miyadav commented Jan 21, 2025

sunzhaohua2 commented Feb 5, 2025

JoelSpeed Feb 5, 2025

Choose a reason for hiding this comment

sunzhaohua2 Feb 6, 2025

Choose a reason for hiding this comment

JoelSpeed Feb 6, 2025

Choose a reason for hiding this comment

sunzhaohua2 Feb 6, 2025

Choose a reason for hiding this comment

sunzhaohua2 Feb 14, 2025

Choose a reason for hiding this comment

JoelSpeed commented Feb 12, 2025

sunzhaohua2 commented Feb 13, 2025

miyadav commented Feb 13, 2025

JoelSpeed commented Feb 13, 2025

openshift-ci-robot commented Feb 14, 2025 • edited by openshift-ci bot Loading

sunzhaohua2 commented Feb 14, 2025

JoelSpeed commented Feb 14, 2025

openshift-ci bot commented Feb 14, 2025

openshift-ci bot commented Feb 14, 2025

sunzhaohua2 commented Feb 17, 2025

openshift-cherrypick-robot commented Feb 17, 2025

sunzhaohua2 commented Jan 21, 2025 •

edited

Loading

openshift-ci-robot commented Jan 21, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Jan 21, 2025 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Feb 14, 2025 •

edited by openshift-ci bot

Loading