Test: When testing KCP remediation Should replace unhealthy machines is failing #10885

Sunnatillo · 2024-07-17T09:22:58Z

Which jobs are failing?

capi-e2e-main capi-e2e-mink8s-main

capi-e2e: [It] When testing KCP remediation Should replace unhealthy machines
{Timed out after 300.001s.
Timed out waiting for Cluster kcp-remediation-yqc57p/kcp-remediation-5y5a3d to provision
Expected
    <string>: Provisioning
to equal
    <string>: Provisioned failed [FAILED] Timed out after 300.001s.
Timed out waiting for Cluster kcp-remediation-yqc57p/kcp-remediation-5y5a3d to provision
Expected
    <string>: Provisioning
to equal
    <string>: Provisioned
In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_helpers.go:144 @ 07/17/24 07:32:51.824

There were additional failures detected after the initial failure. These are visible in the timeline
}

INFO: Waiting for the cluster infrastructure to be provisioned
  STEP: Waiting for cluster to enter the provisioned phase @ 07/17/24 06:51:19.97
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_helpers.go:144 @ 07/17/24 06:56:19.972
  [PANICKED] in [AfterEach] - /usr/local/go/src/runtime/panic.go:261 @ 07/17/24 06:56:19.973
  << Timeline
  [FAILED] Timed out after 300.001s.
  Timed out waiting for Cluster kcp-remediation-1b56mw/kcp-remediation-k2optk to provision
  Expected
      <string>: Provisioning
  to equal
      <string>: Provisioned
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_helpers.go:144 @ 07/17/24 06:56:19.972
  Full Stack Trace

Which tests are failing?

When testing KCP remediation Should replace unhealthy machines is failing

Since when has it been failing?

16.07 20:04 UTC

Testgrid link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1813470576841330688

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind failing-test
/area ci

The text was updated successfully, but these errors were encountered:

chrischdi · 2024-07-17T09:27:58Z

Maybe related to:

🐛 Cluster should be provisoned when cpRef and endpoint is set #10873

Or

🐛 capd: fix nil pointer in dockermachinepool controller #10876

sbueringer · 2024-07-17T09:31:29Z

cc @vincepri

Sunnatillo · 2024-07-17T13:16:33Z

EDITED: It is expected to be so.

Docker machine is not becoming ready:
{"ts":1721221506274.7917,"caller":"controller/controller.go:324","msg":"Reconciler error","controller":"dockermachine","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"DockerMachine","DockerMachine":{"name":"kcp-remediation-t802hp-control-plane-xlczm","namespace":"kcp-remediation-sixb0a"},"namespace":"kcp-remediation-sixb0a","name":"kcp-remediation-t802hp-control-plane-xlczm","reconcileID":"6ca775c3-8a8b-47da-aae8-b50ccecd3dea","err":"failed to exec DockerMachine bootstrap: failed to run cloud config: stdout: Waiting for signal...\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsig

sbueringer · 2024-07-18T10:35:08Z

Did a quick revert PR to verify it was introduced by #10873

(see: #10898)

Sunnatillo · 2024-07-18T10:51:24Z

This line is the problem. In kcp remediation test cluster does not become ready when workload cluster is initialized.

cluster-api/internal/controllers/cluster/cluster_controller_phases.go

Line 56 in ff54cbf

(cluster.Spec.ControlPlaneRef == nil || cluster.Status.ControlPlaneReady) &&

Sunnatillo · 2024-07-18T10:56:29Z

if (cluster.Spec.InfrastructureRef == nil || cluster.Status.InfrastructureReady) &&
		(cluster.Spec.ControlPlaneRef == nil || cluster.Status.ControlPlaneReady) &&
		cluster.Spec.ControlPlaneEndpoint.IsValid()

(cluster.Spec.ControlPlaneRef == nil || cluster.Status.ControlPlaneReady) is problematic because in kcp remediation test controlplane never becomes ready and it should start remediation. With that line cluster never becomes provisioned and remediation will not start.

Sunnatillo · 2024-07-18T10:58:14Z

@fabriziopandini
We can merge revert PR to make CI green.

fabriziopandini · 2024-07-18T14:48:39Z

I don't think it is correct to revert entirely #10873, because cluster phases must be improved to account for the use case where infrastructure ref is not provided

I have created #10903 to make a small fix that should bring tests back to green

fabriziopandini added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 17, 2024

k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 17, 2024

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 17, 2024

k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024

fabriziopandini assigned Sunnatillo Jul 17, 2024

fabriziopandini mentioned this issue Jul 18, 2024

🌱 Partially revert changes for ":bug: Cluster should be provisoned when cpRef and endpoint is set" #10903

Merged

k8s-ci-robot closed this as completed in #10903 Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test: When testing KCP remediation Should replace unhealthy machines is failing #10885

Test: When testing KCP remediation Should replace unhealthy machines is failing #10885

Sunnatillo commented Jul 17, 2024 •

edited

Loading

chrischdi commented Jul 17, 2024 •

edited

Loading

Uh oh!

sbueringer commented Jul 17, 2024

Uh oh!

Sunnatillo commented Jul 17, 2024 •

edited

Loading

Uh oh!

sbueringer commented Jul 18, 2024

Uh oh!

Sunnatillo commented Jul 18, 2024 •

edited

Loading

Uh oh!

Sunnatillo commented Jul 18, 2024

Uh oh!

Sunnatillo commented Jul 18, 2024

Uh oh!

fabriziopandini commented Jul 18, 2024

Uh oh!

Test: When testing KCP remediation Should replace unhealthy machines is failing #10885

Test: When testing KCP remediation Should replace unhealthy machines is failing #10885

Comments

Sunnatillo commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

chrischdi commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Jul 17, 2024

Uh oh!

Sunnatillo commented Jul 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Jul 18, 2024

Uh oh!

Sunnatillo commented Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sunnatillo commented Jul 18, 2024

Uh oh!

Sunnatillo commented Jul 18, 2024

Uh oh!

fabriziopandini commented Jul 18, 2024

Uh oh!

Sunnatillo commented Jul 17, 2024 •

edited

Loading

chrischdi commented Jul 17, 2024 •

edited

Loading

Sunnatillo commented Jul 17, 2024 •

edited

Loading

Sunnatillo commented Jul 18, 2024 •

edited

Loading