Timed out after 1200.000s. Timed out waiting for all control-plane machines in Cluster k8s-upgrade-and-conformance #11296

adilGhaffarDev · 2024-10-16T16:58:40Z

Which jobs are flaking?

periodic-cluster-api-e2e-mink8s-main
periodic-cluster-api-e2e-main

Which tests are flaking?

capi-e2e [It] When upgrading a workload cluster using ClusterClass with a HA control plane [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest

ref: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1845590172469563392

Since when has it been flaking?

More flaky after 11-10-2024

Testgrid link

https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20out%20waiting%20for%20all%20control-plane%20machines%20in%20Cluster%20k8s-upgrade-and-conformance&job=.*-cluster-api-.*&xjob=.*-provider-.*%7C.*-operator-.*

https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1845590172469563392

Reason for failure (if possible)

Network plugin returns error: cni plugin not initialized, expected the Ready condition

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

k8s-ci-robot · 2024-10-16T16:58:48Z

This issue is currently awaiting triage.

If CAPI contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

adilGhaffarDev · 2024-10-16T17:03:09Z

/assign
I am looking into this one

sbueringer · 2024-10-18T12:06:53Z

Just adding it here as well for future-us.

This issue is caused by the following chain of events:

Test is failing because kube-proxy on worker Machine is not working
kube-proxy on worker Machine is not working because kubelet is not working
kubelet is not working because kubelet 1.31 communicates with kube-apiserver 1.30
worker has been upgraded to 1.31 while there were still 1.30 apiservers around
Cluster topology controller thought the control plane upgrade was already done (but we still had a 1.30 CP Machine)
KCP reported .spec.version == .status.version == v1.31
KCP calculates .status.version based on min_version(Machines with healthy apiserver) and the 1.30 CP Machine reported APIServerHealthy false (even though it was still reachable)

We will fix the version calculation via #11304

adilGhaffarDev · 2024-10-30T12:01:55Z

We are not seeing this flake anymore: https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20out%20waiting%20for%20all%20control-plane%20machines%20in%20Cluster%20k8s-upgrade-and-conformance&job=.*-cluster-api-.*&xjob=.*-provider-.*%7C.*-operator-
We can close this issue.
Thank you @chrischdi and @sbueringer for fixing this.

sbueringer · 2024-10-30T13:10:19Z

Perfect! Thx for checking / confirming

/close

k8s-ci-robot · 2024-10-30T13:10:25Z

@sbueringer: Closing this issue.

In response to this:

Perfect! Thx for checking / confirming

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 16, 2024

k8s-ci-robot assigned adilGhaffarDev Oct 16, 2024

sbueringer mentioned this issue Oct 18, 2024

🌱 test: add PreWaitForControlplaneToBeUpgraded to ClusterUpgradeConformanceSpec #11145

Merged

chrischdi mentioned this issue Oct 18, 2024

🐛 kcp: consider all machines for setting .status.version #11304

Merged

k8s-ci-robot closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timed out after 1200.000s. Timed out waiting for all control-plane machines in Cluster k8s-upgrade-and-conformance #11296

Timed out after 1200.000s. Timed out waiting for all control-plane machines in Cluster k8s-upgrade-and-conformance #11296

adilGhaffarDev commented Oct 16, 2024 •

edited

Loading

k8s-ci-robot commented Oct 16, 2024

adilGhaffarDev commented Oct 16, 2024

sbueringer commented Oct 18, 2024 •

edited

Loading

adilGhaffarDev commented Oct 30, 2024

sbueringer commented Oct 30, 2024

k8s-ci-robot commented Oct 30, 2024

Timed out after 1200.000s. Timed out waiting for all control-plane machines in Cluster k8s-upgrade-and-conformance #11296

Timed out after 1200.000s. Timed out waiting for all control-plane machines in Cluster k8s-upgrade-and-conformance #11296

Comments

adilGhaffarDev commented Oct 16, 2024 • edited Loading

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

k8s-ci-robot commented Oct 16, 2024

adilGhaffarDev commented Oct 16, 2024

sbueringer commented Oct 18, 2024 • edited Loading

adilGhaffarDev commented Oct 30, 2024

sbueringer commented Oct 30, 2024

k8s-ci-robot commented Oct 30, 2024

adilGhaffarDev commented Oct 16, 2024 •

edited

Loading

sbueringer commented Oct 18, 2024 •

edited

Loading