Skip to content

Test: When testing KCP remediation Should replace unhealthy machines is failing #10885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Sunnatillo opened this issue Jul 17, 2024 · 8 comments · Fixed by #10903
Closed

Test: When testing KCP remediation Should replace unhealthy machines is failing #10885

Sunnatillo opened this issue Jul 17, 2024 · 8 comments · Fixed by #10903
Assignees
Labels
area/ci Issues or PRs related to ci kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@Sunnatillo
Copy link
Contributor

Sunnatillo commented Jul 17, 2024

Which jobs are failing?

capi-e2e-main capi-e2e-mink8s-main

capi-e2e: [It] When testing KCP remediation Should replace unhealthy machines
{Timed out after 300.001s.
Timed out waiting for Cluster kcp-remediation-yqc57p/kcp-remediation-5y5a3d to provision
Expected
    <string>: Provisioning
to equal
    <string>: Provisioned failed [FAILED] Timed out after 300.001s.
Timed out waiting for Cluster kcp-remediation-yqc57p/kcp-remediation-5y5a3d to provision
Expected
    <string>: Provisioning
to equal
    <string>: Provisioned
In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_helpers.go:144 @ 07/17/24 07:32:51.824

There were additional failures detected after the initial failure. These are visible in the timeline
}
INFO: Waiting for the cluster infrastructure to be provisioned
  STEP: Waiting for cluster to enter the provisioned phase @ 07/17/24 06:51:19.97
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_helpers.go:144 @ 07/17/24 06:56:19.972
  [PANICKED] in [AfterEach] - /usr/local/go/src/runtime/panic.go:261 @ 07/17/24 06:56:19.973
  << Timeline
  [FAILED] Timed out after 300.001s.
  Timed out waiting for Cluster kcp-remediation-1b56mw/kcp-remediation-k2optk to provision
  Expected
      <string>: Provisioning
  to equal
      <string>: Provisioned
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_helpers.go:144 @ 07/17/24 06:56:19.972
  Full Stack Trace

Which tests are failing?

When testing KCP remediation Should replace unhealthy machines is failing

Since when has it been failing?

16.07 20:04 UTC

Testgrid link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-main/1813470576841330688

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind failing-test
/area ci

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. area/ci Issues or PRs related to ci needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 17, 2024
@chrischdi
Copy link
Member

chrischdi commented Jul 17, 2024

Maybe related to:

Or

@sbueringer
Copy link
Member

cc @vincepri

@fabriziopandini fabriziopandini added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jul 17, 2024
@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 17, 2024
@Sunnatillo
Copy link
Contributor Author

Sunnatillo commented Jul 17, 2024

EDITED: It is expected to be so.

Docker machine is not becoming ready:
{"ts":1721221506274.7917,"caller":"controller/controller.go:324","msg":"Reconciler error","controller":"dockermachine","controllerGroup":"infrastructure.cluster.x-k8s.io","controllerKind":"DockerMachine","DockerMachine":{"name":"kcp-remediation-t802hp-control-plane-xlczm","namespace":"kcp-remediation-sixb0a"},"namespace":"kcp-remediation-sixb0a","name":"kcp-remediation-t802hp-control-plane-xlczm","reconcileID":"6ca775c3-8a8b-47da-aae8-b50ccecd3dea","err":"failed to exec DockerMachine bootstrap: failed to run cloud config: stdout: Waiting for signal...\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsignal hold\nsig

@sbueringer
Copy link
Member

Did a quick revert PR to verify it was introduced by #10873

(see: #10898)

@Sunnatillo
Copy link
Contributor Author

Sunnatillo commented Jul 18, 2024

This line is the problem. In kcp remediation test cluster does not become ready when workload cluster is initialized.

(cluster.Spec.ControlPlaneRef == nil || cluster.Status.ControlPlaneReady) &&

@Sunnatillo
Copy link
Contributor Author

if (cluster.Spec.InfrastructureRef == nil || cluster.Status.InfrastructureReady) &&
		(cluster.Spec.ControlPlaneRef == nil || cluster.Status.ControlPlaneReady) &&
		cluster.Spec.ControlPlaneEndpoint.IsValid()

(cluster.Spec.ControlPlaneRef == nil || cluster.Status.ControlPlaneReady) is problematic because in kcp remediation test controlplane never becomes ready and it should start remediation. With that line cluster never becomes provisioned and remediation will not start.

@Sunnatillo
Copy link
Contributor Author

@fabriziopandini
We can merge revert PR to make CI green.

@fabriziopandini
Copy link
Member

I don't think it is correct to revert entirely #10873, because cluster phases must be improved to account for the use case where infrastructure ref is not provided

I have created #10903 to make a small fix that should bring tests back to green

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ci Issues or PRs related to ci kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
5 participants