Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NO-JIRA: Env override NETWORKING_E2E_BOND_MTU #29630

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mgencur
Copy link

@mgencur mgencur commented Mar 31, 2025

This commit introduces NETWORKING_E2E_BOND_MTU variable. The test for creating "bond" interface can read it to override the default value. The default value is used when .status.clusterNetworkMTU is undefined on the Network "cluster". It is automatically set by kernel in that case. The .status.clusterNetworkMTU might not be defined when using a custom CNI plugin such as Cilium.

We have run into test failures when testing Hypershift/HostedControlPlane. When the management cluster has a specific clusterNetworkMTU and the "hosted" cluster uses Cilium CNI then the hosted cluster might use a bigger value for MTU than the management cluster. In this case, the following test error happens:

ERRORED: error configuring pod [e2e-test-bond-tnxmg/pod1] networking: [e2e-test-bond-tnxmg/pod1/24b00190-fbfa-4ac5-94b6-69fb4b697a04:bondnad1]: error adding container to network "bondnad1": Invalid MTU (1500). The requested MTU for bond is bigger than that of the slave link (net1), slave MTU (1400)

Can be seen in this run

This PR allows overriding the default value 1500 from the error above with a value matching the slave MTU.

This commit introduces NETWORKING_E2E_BOND_MTU variable.
The test for creating "bond" interface can read it to override
the default value. The default value is used when .status.clusterNetworkMTU is undefined on the Network source "cluster" and it is automatically set by kernel.

The .status.clusterNetworkMTU might not be defined when using acustom CNI plugin such as Cilium.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Mar 31, 2025
@openshift-ci-robot
Copy link

@mgencur: This pull request explicitly references no jira issue.

In response to this:

This commit introduces NETWORKING_E2E_BOND_MTU variable. The test for creating "bond" interface can read it to override the default value. The default value is used when .status.clusterNetworkMTU is undefined on the Network "cluster". It is automatically set by kernel in that case. The .status.clusterNetworkMTU might not be defined when using a custom CNI plugin such as Cilium.

We have run into test failures when testing Hypershift/HostedControlPlane. When the management cluster has a specific clusterNetworkMTU and the "hosted" cluster uses Cilium CNI then the hosted cluster might use a bigger value for MTU than the management cluster. In this case, the following test error happens:

ERRORED: error configuring pod [e2e-test-bond-tnxmg/pod1] networking: [e2e-test-bond-tnxmg/pod1/24b00190-fbfa-4ac5-94b6-69fb4b697a04:bondnad1]: error adding container to network "bondnad1": Invalid MTU (1500). The requested MTU for bond is bigger than that of the slave link (net1), slave MTU (1400)

Can be seen in this run

This PR allows overriding the default value 1500 from the error above with a value matching the slave MTU.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@mgencur
Copy link
Author

mgencur commented Mar 31, 2025

/cherrypick release-1.20
/cherrypick release-1.19

@openshift-cherrypick-robot

@mgencur: once the present PR merges, I will cherry-pick it on top of release-1.19, release-1.20 in new PRs and assign them to you.

In response to this:

/cherrypick release-1.20
/cherrypick release-1.19

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci openshift-ci bot requested review from danwinship and trozet March 31, 2025 12:48
Copy link
Contributor

openshift-ci bot commented Mar 31, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mgencur
Once this PR has been reviewed and has the lgtm label, please assign adambkaplan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mgencur added a commit to mgencur/release that referenced this pull request Mar 31, 2025
…platform

This commit brings a workflow for running conformance tests on hosted
cluster created by kubevirt. The manamenent cluster is bare metal.

* Introduce step hypershift-kubevirt-health-check-nodecount for cases when HostedCluster doesn't get Complete until Cilium CNI (or other
network stack) is installed on Nodes. In this case, we only check that number of available nodes matches the one from HostedCluster resource.
* Exclude same tests as for hypershift-aws-conformance-cilium workflow
  as they fail for Cilium in general (not specific for kubevirt)
* Exclude "StatefulSet Basic" and "StatefulSet Non-retain". Copied from hypershift-kubevirt-baremetalds-conformance workflow.These tests are flaky on Kubevirt.
* Exclude "[Feature:bond]" until
  openshift/origin#29630 is merged.
mgencur added a commit to mgencur/release that referenced this pull request Mar 31, 2025
…platform

This commit brings a workflow for running conformance tests on hosted
cluster created by kubevirt. The manamenent cluster is bare metal.

* Introduce step hypershift-kubevirt-health-check-nodecount for cases when HostedCluster doesn't get Complete until Cilium CNI (or other
network stack) is installed on Nodes. In this case, we only check that number of available nodes matches the one from HostedCluster resource.
* Exclude same tests as for hypershift-aws-conformance-cilium workflow
  as they fail for Cilium in general (not specific for kubevirt)
* Exclude "StatefulSet Basic" and "StatefulSet Non-retain". Copied from hypershift-kubevirt-baremetalds-conformance workflow.These tests are flaky on Kubevirt.
* Exclude "[Feature:bond]" until
  openshift/origin#29630 is merged.
@mgencur
Copy link
Author

mgencur commented Apr 1, 2025

/retest

mgencur added a commit to mgencur/release that referenced this pull request Apr 1, 2025
…platform

This commit brings a workflow for running conformance tests on hosted
cluster created by kubevirt. The manamenent cluster is bare metal.

* Introduce step hypershift-kubevirt-health-check-nodecount for cases when HostedCluster doesn't get Complete until Cilium CNI (or other
network stack) is installed on Nodes. In this case, we only check that number of available nodes matches the one from HostedCluster resource.
* Exclude same tests as for hypershift-aws-conformance-cilium workflow
  as they fail for Cilium in general (not specific for kubevirt)
* Exclude "StatefulSet Basic" and "StatefulSet Non-retain". Copied from hypershift-kubevirt-baremetalds-conformance workflow.These tests are flaky on Kubevirt.
* Exclude "[Feature:bond]" until
  openshift/origin#29630 is merged.
mgencur added a commit to mgencur/release that referenced this pull request Apr 1, 2025
…platform

This commit brings a workflow for running conformance tests on hosted
cluster created by kubevirt. The manamenent cluster is bare metal.

* Introduce step hypershift-kubevirt-health-check-nodecount for cases when HostedCluster doesn't get Complete until Cilium CNI (or other
network stack) is installed on Nodes. In this case, we only check that number of available nodes matches the one from HostedCluster resource.
* Exclude same tests as for hypershift-aws-conformance-cilium workflow
  as they fail for Cilium in general (not specific for kubevirt)
* Exclude "StatefulSet Basic" and "StatefulSet Non-retain". Copied from hypershift-kubevirt-baremetalds-conformance workflow.These tests are flaky on Kubevirt.
* Exclude "[Feature:bond]" until
  openshift/origin#29630 is merged.
Copy link
Contributor

openshift-ci bot commented Apr 1, 2025

@mgencur: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback 8ee656a link false /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
ci/prow/e2e-aws-ovn-etcd-scaling 8ee656a link false /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway 8ee656a link false /test e2e-metal-ipi-ovn-dualstack-local-gateway
ci/prow/e2e-gcp-ovn-etcd-scaling 8ee656a link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout 8ee656a link false /test e2e-metal-ipi-ovn-kube-apiserver-rollout
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 8ee656a link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-azure-ovn-etcd-scaling 8ee656a link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-gcp-fips-serial 8ee656a link false /test e2e-gcp-fips-serial
ci/prow/okd-e2e-gcp 8ee656a link false /test okd-e2e-gcp
ci/prow/e2e-aws-disruptive 8ee656a link false /test e2e-aws-disruptive
ci/prow/e2e-vsphere-ovn-etcd-scaling 8ee656a link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-azure-ovn-upgrade 8ee656a link false /test e2e-azure-ovn-upgrade
ci/prow/e2e-gcp-disruptive 8ee656a link false /test e2e-gcp-disruptive
ci/prow/e2e-openstack-serial 8ee656a link false /test e2e-openstack-serial
ci/prow/e2e-metal-ipi-virtualmedia 8ee656a link false /test e2e-metal-ipi-virtualmedia

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link

openshift-trt bot commented Apr 1, 2025

Job Failure Risk Analysis for sha: 8ee656a

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive Medium
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis
---
[bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-aws-ovn-etcd-scaling Low
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 50.00% of 2 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:aws SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-azure-ovn-etcd-scaling Medium
[bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
Potential external regression detected for High Risk Test analysis
pull-ci-openshift-origin-main-e2e-gcp-ovn-etcd-scaling Low
[bz-Cloud Compute] clusteroperator/control-plane-machine-set should not change condition/Degraded
This test has passed 0.00% of 1 runs on release 4.19 [Architecture:amd64 FeatureSet:default Installer:ipi JobTier:rare Network:ovn NetworkStack:ipv4 Owner:eng Platform:gcp SecurityMode:default Topology:ha Upgrade:none] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-dualstack-local-gateway IncompleteTests
Tests for this run (102) are below the historical average (2402): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants