Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPQE-27285: Fix the failures in qe ci jobs #374

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

sunzhaohua2
Copy link
Contributor

@sunzhaohua2 sunzhaohua2 commented Jan 21, 2025

Fixed some issues found in qe ci jobs, @huali9 @miyadav @shellyyang1989 PTAL

  • When one case failed, other cases will be interrupted, set --fail-fast=false so that other cases can be tested.
  • Spot case failed on arm clusters, add instance types for arm clusters
  • Update GetArchitectureFromMachineSetNodes to get arch from annotation
  • Skip spot case on customer vpc clusters, as error:
termination-simulator-p862v                           1/1     Running   0              31s
$ oc logs -f termination-simulator-p862v          
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/aarch64/APKINDEX.tar.gz
  • If machine creation failed, then stop waiting.
  • Some fields are not default for customer vpc cluster, add them when creating machine from a minimal providerSpec

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jan 21, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 21, 2025

@sunzhaohua2: This pull request references OCPQE-27285 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Fixed some issues found in qe ci jobs, @huali9 @miyadav @shellyyang1989 PTAL

  • When one case failed, other cases will be interrupted, set --fail-fast=false so that other cases can be tested.
  • Spot case failed on arm clusters, add instance types for arm clusters
  • Update GetArchitectureFromMachineSetNodes to get arch from annotation
  • Skip spot case on disconnected clusters, as error:
termination-simulator-p862v                           1/1     Running   0              31s
$ oc logs -f termination-simulator-p862v          
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/aarch64/APKINDEX.tar.gz
  • Pushed golang image to internal registry so that disconnected cluster can access
  • Webhook if machine creation failed, then stop waiting.
  • Some field are not default for azure customer vpc cluster, add them when creating machine from a minimal providerSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from elmiko and sub-mod January 21, 2025 02:13
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jan 21, 2025

@sunzhaohua2: This pull request references OCPQE-27285 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Fixed some issues found in qe ci jobs, @huali9 @miyadav @shellyyang1989 PTAL

  • When one case failed, other cases will be interrupted, set --fail-fast=false so that other cases can be tested.
  • Spot case failed on arm clusters, add instance types for arm clusters
  • Update GetArchitectureFromMachineSetNodes to get arch from annotation
  • Skip spot case on disconnected clusters, as error:
termination-simulator-p862v                           1/1     Running   0              31s
$ oc logs -f termination-simulator-p862v          
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/aarch64/APKINDEX.tar.gz
  • Pushed golang image to internal registry so that disconnected cluster can access
  • Webhook if machine creation failed, then stop waiting.
  • Some fields are not default for customer vpc cluster, add them when creating machine from a minimal providerSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@miyadav
Copy link
Member

miyadav commented Jan 21, 2025

Thanks @sunzhaohua2 , I was trying if we can build image from this PR and test it like a pre-merged but , it won't that way , may be we can merge , do you think of any other way to test ?
{ could not resolve inputs: could not determine inputs for step [input:cluster-api-actuator-pkg-test]: could not resolve base image from ci-ln-tfckgw2/cluster-api-actuator-pkg-test:latest: imagestreamtags.image.openshift.io "cluster-api-actuator-pkg-test" not found}

@sunzhaohua2
Copy link
Contributor Author

Thanks @sunzhaohua2 , I was trying if we can build image from this PR and test it like a pre-merged but , it won't that way , may be we can merge , do you think of any other way to test ? { could not resolve inputs: could not determine inputs for step [input:cluster-api-actuator-pkg-test]: could not resolve base image from ci-ln-tfckgw2/cluster-api-actuator-pkg-test:latest: imagestreamtags.image.openshift.io "cluster-api-actuator-pkg-test" not found}

Thanks Milind for helping checking, I tested locally with below commands, all cases can be tested even if there are failed. For the failed ones I tested locally and passed on azure versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm
$ hack/ci-integration.sh --junit-report=junit_cluster-api-actuator-testutils.xml --label-filter='disruptive' -p > disruptive.log

@miyadav
Copy link
Member

miyadav commented Jan 21, 2025

Thanks @sunzhaohua2 , I was trying if we can build image from this PR and test it like a pre-merged but , it won't that way , may be we can merge , do you think of any other way to test ? { could not resolve inputs: could not determine inputs for step [input:cluster-api-actuator-pkg-test]: could not resolve base image from ci-ln-tfckgw2/cluster-api-actuator-pkg-test:latest: imagestreamtags.image.openshift.io "cluster-api-actuator-pkg-test" not found}

Thanks Milind for helping checking, I tested locally with below commands, all cases can be tested even if there are failed. For the failed ones I tested locally and passed on azure versioned-installer-customer_vpc-disconnected-fully_private_cluster-arm $ hack/ci-integration.sh --junit-report=junit_cluster-api-actuator-testutils.xml --label-filter='disruptive' -p > disruptive.log

Yeah, locally worked well for me too , thanks for checking ..

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2025
@sunzhaohua2
Copy link
Contributor Author

@JoelSpeed can you help to take a look when you have time?

@@ -7,7 +7,7 @@ go run ./vendor/github.com/onsi/ginkgo/v2/ginkgo \
-v \
--timeout=115m \
--grace-period=5m \
--fail-fast \
--fail-fast=false \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, why do we switch to false here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now if one case failed, the left cases will be marked as [INTERRUPTED], I change this to continue run other test cases even if a certain test fails.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that the failed test case won't interfere with the later test cases? It will definitely be cleaned up properly?

I wouldn't want the failure to then impact the later tests somehow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran the disruptive cases locally on azure versioned-installer-fully_private_cluster-proxy cluster, didn't find the failed case interfere with the later test cases, will check on aws and gcp as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I run several times on aws/gcp, didn't find the failed case interfere with the later test cases. switch to false here looks safe.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2025
@JoelSpeed
Copy link
Contributor

Apart from a pending reply on #374 (comment), I think this looks ok and we can look to get this merged once we work out this thread

@sunzhaohua2
Copy link
Contributor Author

Apart from a pending reply on #374 (comment), I think this looks ok and we can look to get this merged once we work out this thread

Thanks @JoelSpeed I ran disruptive cases on aws and gcp, didn't find failures impact the later tests.
I am checking the spot instance case, termination-simulator is not only failed in discconected clusters, but also failed in customer_vpc clusters, I am not sure how to skip customer_vpc cluster.

 $ oc logs -f termination-simulator-62htc -c iptables                                                                                                        
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
^C

I am thinking if need to change the script to make iptables and bind-tools are included in a container image, and I push it to quay.io/openshifttest, then disconnected cluster can access, or we just run this case on dev clusters and skip on qe clusters.

@miyadav
Copy link
Member

miyadav commented Feb 13, 2025

Apart from a pending reply on #374 (comment), I think this looks ok and we can look to get this merged once we work out this thread

Thanks @JoelSpeed I ran disruptive cases on aws and gcp, didn't find failures impact the later tests. I am checking the spot instance case, termination-simulator is not only failed in discconected clusters, but also failed in customer_vpc clusters, I am not sure how to skip customer_vpc cluster.

 $ oc logs -f termination-simulator-62htc -c iptables                                                                                                        
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
^C

I am thinking if need to change the script to make iptables and bind-tools are included in a container image, and I push it to quay.io/openshifttest, then disconnected cluster can access, or we just run this case on dev clusters and skip on qe clusters.

@JoelSpeed
Copy link
Contributor

am thinking if need to change the script to make iptables and bind-tools are included in a container image, and I push it to quay.io/openshifttest, then disconnected cluster can access, or we just run this case on dev clusters and skip on qe clusters.

Would this be a lot of work?

Does merging this PR in its current state make things awkward/those tests start failing anywhere? Do we need to add some skips until we can work out a solution to the spot problem?

@openshift-ci-robot
Copy link

openshift-ci-robot commented Feb 14, 2025

@sunzhaohua2: This pull request references OCPQE-27285 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Fixed some issues found in qe ci jobs, @huali9 @miyadav @shellyyang1989 PTAL

  • When one case failed, other cases will be interrupted, set --fail-fast=false so that other cases can be tested.
  • Spot case failed on arm clusters, add instance types for arm clusters
  • Update GetArchitectureFromMachineSetNodes to get arch from annotation
  • Skip spot case on customer vpc clusters, as error:
termination-simulator-p862v                           1/1     Running   0              31s
$ oc logs -f termination-simulator-p862v          
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/main/aarch64/APKINDEX.tar.gz
ERROR: https://dl-cdn.alpinelinux.org/alpine/v3.14/main: network error (check Internet connection and firewall)
WARNING: Ignoring https://dl-cdn.alpinelinux.org/alpine/v3.14/main: No such file or directory
fetch https://dl-cdn.alpinelinux.org/alpine/v3.14/community/aarch64/APKINDEX.tar.gz
  • If machine creation failed, then stop waiting.
  • Some fields are not default for customer vpc cluster, add them when creating machine from a minimal providerSpec

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sunzhaohua2
Copy link
Contributor Author

Would this be a lot of work?

Does merging this PR in its current state make things awkward/those tests start failing anywhere? Do we need to add some skips until we can work out a solution to the spot problem?

I tried to build image yesterday, I had a problem when I was using docker buildx to build image support multiple platforms. I have skipped the failed tests on customer vpc cluster and should be able to merge. I will check if I can improve it later.

@JoelSpeed
Copy link
Contributor

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2025
Copy link
Contributor

openshift-ci bot commented Feb 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2025
Copy link
Contributor

openshift-ci bot commented Feb 14, 2025

@sunzhaohua2: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-operator 5f891b6 link false /test e2e-openstack-operator

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 135fde7 into openshift:master Feb 14, 2025
8 of 9 checks passed
@sunzhaohua2
Copy link
Contributor Author

/cherry-pick release-4.18

@openshift-cherrypick-robot

@sunzhaohua2: new pull request created: #380

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants