Skip to content

Timed out after 1200.000s. Timed out waiting for all control-plane machines in Cluster k8s-upgrade-and-conformance #11296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
adilGhaffarDev opened this issue Oct 16, 2024 · 6 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@adilGhaffarDev
Copy link
Contributor

adilGhaffarDev commented Oct 16, 2024

Which jobs are flaking?

  • periodic-cluster-api-e2e-mink8s-main
  • periodic-cluster-api-e2e-main

Which tests are flaking?

capi-e2e [It] When upgrading a workload cluster using ClusterClass with a HA control plane [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest

ref: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1845590172469563392

Since when has it been flaking?

More flaky after 11-10-2024

Testgrid link

https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20out%20waiting%20for%20all%20control-plane%20machines%20in%20Cluster%20k8s-upgrade-and-conformance&job=.*-cluster-api-.*&xjob=.*-provider-.*%7C.*-operator-.*

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1845590172469563392

Reason for failure (if possible)

Network plugin returns error: cni plugin not initialized, expected the Ready condition

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@adilGhaffarDev
Copy link
Contributor Author

/assign
I am looking into this one

@sbueringer
Copy link
Member

sbueringer commented Oct 18, 2024

Just adding it here as well for future-us.

This issue is caused by the following chain of events:

  • Test is failing because kube-proxy on worker Machine is not working
  • kube-proxy on worker Machine is not working because kubelet is not working
  • kubelet is not working because kubelet 1.31 communicates with kube-apiserver 1.30
  • worker has been upgraded to 1.31 while there were still 1.30 apiservers around
  • Cluster topology controller thought the control plane upgrade was already done (but we still had a 1.30 CP Machine)
  • KCP reported .spec.version == .status.version == v1.31
  • KCP calculates .status.version based on min_version(Machines with healthy apiserver) and the 1.30 CP Machine reported APIServerHealthy false (even though it was still reachable)

We will fix the version calculation via #11304

@adilGhaffarDev
Copy link
Contributor Author

We are not seeing this flake anymore: https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20out%20waiting%20for%20all%20control-plane%20machines%20in%20Cluster%20k8s-upgrade-and-conformance&job=.*-cluster-api-.*&xjob=.*-provider-.*%7C.*-operator-
We can close this issue.
Thank you @chrischdi and @sbueringer for fixing this.

@sbueringer
Copy link
Member

Perfect! Thx for checking / confirming

/close

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

Perfect! Thx for checking / confirming

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

3 participants