Description
We recently found an issue in KCP upgrades. This only lead to failures in self-hosted tests, but in general KCP was upgrading through the entire control plane and only static pods came up (kube-proxy, CNI did not). Also the Nodes weren't ready during the upgrade.
I think we should improve our e2e test coverage to detect if the control plane is in a state like this during upgrades.
This should be either additional validation while the upgrade is running or potentially we can e.g. deploy a StatefulSet with PDBs that runs on CP nodes. In this case the test would have failed with the issue from #10947
Bonus: Potentially the Statefulset with PDBs we deploy could also use volumes for some bonus test coverage for "wait for volume detach" but it's unclear if we have a CSI implementation for CAPD that would work for that (maybe the provisioner that comes with kind out-of-the-box)