Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

METAL-966: Add metal3 provider #175

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

honza
Copy link
Member

@honza honza commented Jun 7, 2024

No description provided.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 7, 2024
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 7, 2024

@honza: This pull request references METAL-966 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from damdo and nrb June 7, 2024 19:29
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2024
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 11, 2024
@honza honza force-pushed the add-metal3 branch 3 times, most recently from 2d6320e to 3e68300 Compare June 13, 2024 17:42
Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @honza the changes you are adding here look reasonable to me.

Although we recently merged #169 which slightly changes how we handle InfraCluster generation for each platform, and moves some of the switch/case around, would you be able to adapt this PR to that change?

Before merging this PR we would need to add e2e tests (similarly to what we already do with powervs/gcp/vsphere in this repo), which will require to us to get openshift/cluster-api-provider-metal3#18 merged first

@honza honza force-pushed the add-metal3 branch 3 times, most recently from 8d95a47 to ff1c497 Compare July 30, 2024 08:57
@honza honza force-pushed the add-metal3 branch 2 times, most recently from 410528e to 5eed1ef Compare August 6, 2024 08:45
@honza honza force-pushed the add-metal3 branch 2 times, most recently from bd716a1 to 1399782 Compare August 20, 2024 14:58
@honza honza force-pushed the add-metal3 branch 5 times, most recently from ef1f476 to 1fc88f7 Compare September 15, 2024 00:11
@honza honza force-pushed the add-metal3 branch 2 times, most recently from a04c708 to dcfcf2e Compare October 1, 2024 14:44
@honza honza force-pushed the add-metal3 branch 2 times, most recently from 2fc2992 to 44bd53d Compare February 24, 2025 22:41
@honza honza force-pushed the add-metal3 branch 2 times, most recently from 249ab16 to 5a298c1 Compare March 18, 2025 14:48
@nrb
Copy link
Contributor

nrb commented Mar 18, 2025

/lgtm

though deferring to @damdo for approval

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 18, 2025
Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @honza
and sorry it took me a while to review it again.

Changes mostly LGTM, I am still seeing some errors in the e2e for cluster-capi-operator logs .

I0318 17:10:57.753893       1 corecluster_controller.go:97] "Reconciling core cluster" logger="CoreClusterController" controller="CoreClusterController" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="e28bbbab-82a2-4116-880a-098c385c4475"
I0318 17:10:57.754254       1 clusteroperator_controller.go:48] "Reconciling \"cluster-api\" ClusterObject" logger="ClusterOperatorController" controller="ClusterOperatorController" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="2d247a45-e414-4aae-a453-63ef6fa63f70"
I0318 17:10:57.754339       1 infracluster_controller.go:87] "Reconciling InfraCluster" logger="InfraClusterController" controller="InfraClusterController" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="de31d60e-ed68-4975-b5e6-52bf532bb85f"
I0318 17:10:57.764557       1 corecluster_controller.go:139] "Finished reconciling core cluster" logger="CoreClusterController" controller="CoreClusterController" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="e28bbbab-82a2-4116-880a-098c385c4475"
E0318 17:10:57.764599       1 controller.go:316] "Reconciler error" err="failed to ensure core cluster: failed to get infra cluster openshift-cluster-api/ostest-g6k4h: no matches for kind \"BareMetalCluster\" in version \"infrastructure.cluster.x-k8s.io/v1beta1\"" controller="CoreClusterController" controllerGroup="config.openshift.io" controllerKind="ClusterOperator" ClusterOperator="cluster-api" namespace="" name="cluster-api" reconcileID="e28bbbab-82a2-4116-880a-098c385c4475"

To fix that you'll need to change the corecluster controller's mapOCPPlatformToInfraClusterKindAndVersion function (here) to return a BareMetalCluster Kind for the metal platform.

Once that's done let's see what the e2e say, and then I can add my labels. Thanks!

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Mar 19, 2025
Copy link
Contributor

openshift-ci bot commented Mar 19, 2025

New changes are detected. LGTM label has been removed.

Copy link
Contributor

openshift-ci bot commented Mar 19, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dtantsur
Once this PR has been reviewed and has the lgtm label, please ask for approval from nrb. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@damdo
Copy link
Member

damdo commented Mar 19, 2025

It looks like the CAPI Machine is not going into Running phase.

I0319 18:51:24.078987       1 machine_controller_noderef.go:72] "Waiting for infrastructure provider to report spec.providerID" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="openshift-cluster-api/baremetal-machineset-xw5rw" namespace="openshift-cluster-api" name="baremetal-machineset-xw5rw" reconcileID="0725aa5b-f094-402b-ac0b-850e6991183d" Cluster="openshift-cluster-api/ostest-mj8qv" MachineSet="openshift-cluster-api/baremetal-machineset" Cluster="openshift-cluster-api/ostest-mj8qv" Metal3Machine="openshift-cluster-api/baremetal-machineset-xw5rw"

Something might be off with the capm3-controller-manager not being able to set the providerID.
I am not familiar with the logs there, but maybe you can spot the issue @honza
see: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-capi-operator/175/pull-ci-openshift-cluster-capi-operator-main-e2e-metal3-capi-techpreview/1902388542647046144/artifacts/e2e-metal3-capi-techpreview/gather-extra/artifacts/pods/openshift-cluster-api_capm3-controller-manager-bd5f89998-2wkcn_manager.log

@honza
Copy link
Member Author

honza commented Mar 20, 2025

Yes, that's expected. We're still working on a fix in our provider. Thanks

@damdo
Copy link
Member

damdo commented Apr 4, 2025

@honza Hey any news on the provider fix? Do you have a link to the upstream bug?
It would be nice to get this merged.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2025
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 4, 2025
@honza
Copy link
Member Author

honza commented Apr 7, 2025

/test e2e-metal3-capi-techpreview

@damdo
Copy link
Member

damdo commented Apr 7, 2025

@honza it was too early to test because the image for openshift/cluster-api-provider-metal3#38 hadn't finished building yet. Now it has finished, so it should be in the testing payload:

/test e2e-metal3-capi-techpreview

Copy link
Contributor

openshift-ci bot commented Apr 7, 2025

@honza: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure-ovn-techpreview fafd89d link false /test e2e-azure-ovn-techpreview
ci/prow/regression-clusterinfra-cucushift-rehearse-capi-aws-ipi fafd89d link false /test regression-clusterinfra-cucushift-rehearse-capi-aws-ipi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@damdo
Copy link
Member

damdo commented Apr 7, 2025

Seeing lots of logs like:

I0407 13:14:31.226229       1 metal3machine_manager.go:1830] "The node does not match expected providerID. Considering other nodes " logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="openshift-cluster-api/baremetal-machineset-ngv9d" machine="baremetal-machineset-ngv9d" cluster="ostest-p4lxz" metal3-cluster="ostest-p4lxz" node="worker-2.ostest.test.metalkube.org" providerID="baremetalhost:///openshift-machine-api/ostest-worker-2/843374c7-1596-43a5-9277-f03a7e043d6a"
I0407 13:14:31.232067       1 metal3machine_manager.go:1332] "requeuing, could not find node with label: metal3.io/uuid=e47f3e48-812b-4381-a090-c2df1c031e88" logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="openshift-cluster-api/baremetal-machineset-ngv9d" machine="baremetal-machineset-ngv9d" cluster="ostest-p4lxz" metal3-cluster="ostest-p4lxz"
E0407 13:14:31.232096       1 metal3machine_controller.go:275] "Failed to set the target node providerID" err="requeuing, could not find node with label: metal3.io/uuid=e47f3e48-812b-4381-a090-c2df1c031e88. Object will be requeued after 30s" logger="controllers.Metal3Machine" providerID=""
I0407 13:14:47.818590       1 metal3labelsync_controller.go:150] "Could not find Node Ref on Machine object, will retry" logger="controllers.Metal3LabelSync.metal3-label-sync-controller" metal3-label-sync="openshift-cluster-api/ostest-extraworker-0"
I0407 13:14:55.791395       1 metal3machinetemplate_manager.go:64] "Fetching metal3Machine objects" logger="controllers.Metal3MachineTemplate.Metal3MachineTemplate-controller" metal3-machine-template="openshift-cluster-api/baremetal-machine-template"
I0407 13:14:55.795896       1 metal3machine_manager.go:674] "Updating machine" logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="openshift-cluster-api/baremetal-machineset-ngv9d" machine="baremetal-machineset-ngv9d" cluster="ostest-p4lxz" metal3-cluster="ostest-p4lxz"
I0407 13:14:55.799148       1 metal3machine_manager.go:1110] "Deleting nodeReuseLabelName from host, if any" logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="openshift-cluster-api/baremetal-machineset-ngv9d" machine="baremetal-machineset-ngv9d" cluster="ostest-p4lxz" metal3-cluster="ostest-p4lxz"
I0407 13:14:55.799926       1 metal3machine_manager.go:715] "Finished updating machine" logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="openshift-cluster-api/baremetal-machineset-ngv9d" machine="baremetal-machineset-ngv9d" cluster="ostest-p4lxz" metal3-cluster="ostest-p4lxz"
I0407 13:14:55.818989       1 metal3machine_manager.go:1830] "The node does not match expected providerID. Considering other nodes " logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="openshift-cluster-api/baremetal-machineset-ngv9d" machine="baremetal-machineset-ngv9d" cluster="ostest-p4lxz" metal3-cluster="ostest-p4lxz" node="master-0.ostest.test.metalkube.org" providerID="baremetalhost:///openshift-machine-api/ostest-master-0/c77efb99-282b-417d-b0f4-bb3a27a7243e"

In the CAPM3 provider logs: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_cluster-capi-operator/175/pull-ci-openshift-cluster-capi-operator-main-e2e-metal3-capi-techpreview/1909210066729308160/artifacts/e2e-metal3-capi-techpreview/gather-extra/artifacts/pods/openshift-cluster-api_capm3-controller-manager-fdf8cfc7f-5qkbh_manager.log

@honza
Copy link
Member Author

honza commented Apr 7, 2025

I0407 13:14:31.226229 1 metal3machine_manager.go:1830] "The node does not match expected providerID. Considering other nodes "

We're using the old image. The above error message is on a different line in latest main: https://github.com/openshift/cluster-api-provider-metal3/blob/fbcbf9a7597a0c943624cd4415da69612040cfa3/baremetal/metal3machine_manager.go#L2004

@honza
Copy link
Member Author

honza commented Apr 7, 2025

CI build was accepted, the new provider code is in 4.19.0-0.ci-2025-04-07-122331

/test e2e-metal3-capi-techpreview

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants