-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Node orphaned and stuck after early deletion of Machine #7237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This is a tricky one. It is required to think carefully if/how we can handle this use case... could the infra provider should surface node ref even if it pops up during the delete, but this doesn't really ensure this happens before deletion kicks in. |
Thanks for your insights @fabriziopandini ! I had a feeling this may not be easily solved but wanted to at least create an issue after spending the time to figure out what happened. 😄 |
/triage accepted |
We have observed the same issue when working in a slow CloudStack environment. Basically we're trying to scale up our cluster and request new machines. These CAPI machines get created, then the MachineHealthCheck activates and deletes them while the node is coming up, before nodeRef is set. Increasing the MachineHealthCheck timeout is one option. What would the team think about some process which cleans up orphaned nodes? They could be detected as nodes on a cluster missing the CAPI annotations such as
|
Not sure about just cleaning up all Nodes which are not annotated by Cluster API. Wondering if that breaks some use cases. Also our annotations are not set immediately so we have to be careful regarding race conditions. One alternative could look like this:
Essentially during deletion if the node ref is not set we could list the nodes of the workload cluster filtered by the machine name label and then delete it. (I think setting the label would be something that the bootstrap provider has to implement and Cluster API can act on it in case a node with that label exists, if not the behavior is as of today) |
The CloudNodeLifecycleController monitors nodes and would delete the node if it's not found in the cloud-provider. Refer https://github.com/kubernetes/cloud-provider/blob/a40fcba9db0de1bd377f8c146e41a2c2809ea3cd/controllers/nodelifecycle/node_lifecycle_controller.go#L147 |
Not all providers have a cloud-provider. See for example Metal3. |
Reading the last comments I'm starting to think that a pragmatic way to start addressing this problem is that the providers without a cloud provider should implement their own cleanup loop deleting nodes without a corresponding machine; CAPI is already providing a couple of building blocks that should make this possible, like the cluster cache tracker. In a follow-up iteration, we can eventually pick one of the controllers implementing this cleanup loop and try to make it a "optional, generic controller" hosted in CAPI and used by all the providers that need it, but I will defer this to when we have a few implementations to look at try to figure out what is generic and what instead should be pluggable/extensible. |
This issue has not been updated in over 1 year, and should be re-triaged. You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
/priority important-longterm |
This problem is addressed in CPI controllers; if some infrastructure provider doesn't have a CPI construct, they should think about building at least a subset of it to solve this problem (but potentially many others that requires a deep knowledge of the infrastructure like e.g volumes) /close |
@fabriziopandini: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
FYI https://github.com/kubernetes-sigs/cloud-provider-kind this should be a small CPI implementation (but did not have time to look into yet) |
Uh oh!
There was an error while loading. Please reload this page.
What steps did you take and what happened:
This is a bit tricky because it depends on timing. We accidentally stumbled across it in the Metal3 provider because we made a mistake in our e2e tests. The gist is that we didn't wait for a Machine to become running before it was deleted as part of a change to the MachineDeployment (but the underlying infrastructure was provisioned).
It goes something like this:
What did you expect to happen:
The Node should be removed together with the Machine.
Anything else you would like to add:
Logs from CAPI about the relevant Machine/Node when this happens (
grep <machine-name
):For comparison, this is how the logs looks like for a Machine/Node where this does not happen. Note the message when setting NodeRef. This is missing in the above logs. Also some other differences, like draining, but I guess that could be because no draining is needed when the node never got ready?
Environment:
kubectl version
): v1.23.8/etc/os-release
): CentOS Stream 9/kind bug
The text was updated successfully, but these errors were encountered: