Node should be deleted after infraMachine is gone #2565

enxebre · 2020-03-06T12:56:30Z

What steps did you take and what happened:
Delete a machine

What did you expect to happen:
Currently when we signal a machine for deletion we respect pod safety by honouring PDB (draining), then we delete the node, then we signal the InfraMachine for deletion.

cluster-api/controllers/machine_controller.go

Lines 275 to 303 in e9038a5

    
           	// Drain node before deletion 
        
           	if _, exists := m.ObjectMeta.Annotations[clusterv1.ExcludeNodeDrainingAnnotation]; !exists { 
        
           		logger.Info("Draining node", "node", m.Status.NodeRef.Name) 
        
           		if err := r.drainNode(ctx, cluster, m.Status.NodeRef.Name, m.Name); err != nil { 
        
           			r.recorder.Eventf(m, corev1.EventTypeWarning, "FailedDrainNode", "error draining Machine's node %q: %v", m.Status.NodeRef.Name, err) 
        
           			return ctrl.Result{}, err 
        
           		} 
        
           		r.recorder.Eventf(m, corev1.EventTypeNormal, "SuccessfulDrainNode", "success draining Machine's node %q", m.Status.NodeRef.Name) 
        
           	} 
        
           	logger.Info("Deleting node", "node", m.Status.NodeRef.Name) 
        
           	var deleteNodeErr error 
        
           	waitErr := wait.PollImmediate(2*time.Second, 10*time.Second, func() (bool, error) { 
        
           		if deleteNodeErr = r.deleteNode(ctx, cluster, m.Status.NodeRef.Name); deleteNodeErr != nil && !apierrors.IsNotFound(deleteNodeErr) { 
        
           			return false, nil 
        
           		} 
        
           		return true, nil 
        
           	}) 
        
           	if waitErr != nil { 
        
           		logger.Error(deleteNodeErr, "Timed out deleting node, moving on", "node", m.Status.NodeRef.Name) 
        
           		r.recorder.Eventf(m, corev1.EventTypeWarning, "FailedDeleteNode", "error deleting Machine's node: %v", deleteNodeErr) 
        
           	} 
        
           } 
        
           if ok, err := r.reconcileDeleteExternal(ctx, m); !ok || err != nil { 
        
           	// Return early and don't remove the finalizer if we got an error or 
        
           	// the external reconciliation deletion isn't ready. 
        
           	return ctrl.Result{}, err 
        
           }

Node deletion is one of possible the mechanisms to force a stateful pod to be deleted.
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md#pod-safety-consistency-guarantees-and-storage-implications

If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."

To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Cluster-api version:
Minikube/KIND version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

/kind bug

The text was updated successfully, but these errors were encountered:

enxebre · 2020-03-06T13:01:35Z

cc @JoelSpeed @detiber @ncdc

ncdc · 2020-03-06T13:04:11Z

👍 this has been in my head for a while but I never got around to filing. Thank you! This will also speed up deletion, because right now when we remove the Node from the apiserver, the kubelet is still running, and it's trying to update its .status, which fails. I think this causes the shutdown of the kubelet process to take longer than it otherwise would (definitely noticeable delays).

JoelSpeed · 2020-03-06T13:11:17Z

Should the machine controller be deleting the Node at all? I thought this was the responsibility of the K8s cloud controller manager?

From the K8s docs on cloud-controller-manager

node controller - responsible for updating kubernetes nodes using cloud APIs and deleting kubernetes nodes that were deleted on your cloud.

My expectation of how this worked would have been:

Machine is deleted by something
Machine controller cordons/drains node
Machine controller deletes InfrastructureMachine
InfrastructureMachine controller sends request to cloud provider to terminate instance
Instance shuts down
Cloud controller manager removes the Node from the cluster

I think the important part of this is that the Node is removed after kubelet has stopped as @ncdc suggested. Is there a reason we can't rely on the cloud controller manager for this and have to do it ourselves? (I'm guessing something to do with many providers, not all of them implementing the node deletion part of CCM?)

EDIT: I see this was added in #809 to ensure that nodes are cleaned up if there is no CCM

enxebre · 2020-03-06T13:22:25Z

I see Cloud controller manager as something orthogonal here. We want to provide the same provider agnostic CAPI semantics/behaviour/UX and not every environment is guaranteed to have a Cloud controller manager nor to delete the node, e.g libvirt, baremetal...
Plus we are the authority source voluntarily destroying the underlying infra, I see not point on keeping the node.

detiber · 2020-03-06T14:43:18Z

Just to echo some of the points above (and expand a bit) on the reasons we added the Node deletion. It was indeed to remain consistent in cases where a provider may not have an associated CCM, but also to provide consistency in cases where there were different behaviors exhibited by CCMs for different providers. I believe maintaining consistency here is important.

We can definitely talk about trying to get to a point in the future where there is a requirement to have a CCM for a given infrastructure provider, however we would also have to make sure that there is consistent behavior between them as well.

enxebre · 2020-03-06T15:56:44Z

/assign @enxebre

If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee." To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running. kubernetes-sigs#2565

vincepri · 2020-03-06T17:39:21Z

/milestone v0.3.x

ncdc · 2020-03-06T20:45:32Z

/lifecycle active

If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee." To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running. kubernetes-sigs#2565

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2020

k8s-ci-robot assigned enxebre Mar 6, 2020

enxebre mentioned this issue Mar 6, 2020

🐛 Ensure infrastructure is destroyed before deleting the node #2570

Merged

k8s-ci-robot added this to the v0.3.x milestone Mar 6, 2020

k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 6, 2020

k8s-ci-robot closed this as completed in #2570 Mar 16, 2020

enxebre mentioned this issue Nov 10, 2021

✨ Add nodeDeletionTimeout property to Machine #5608

Merged

enxebre mentioned this issue Jul 18, 2022

use --disable-eviction option when drain node #6929

Closed

enxebre mentioned this issue Oct 17, 2022

feat: Added "Scale dataplane to zero" workflow to the documentation openshift/hypershift#1804

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node should be deleted after infraMachine is gone #2565

Node should be deleted after infraMachine is gone #2565

enxebre commented Mar 6, 2020 •

edited

Loading

enxebre commented Mar 6, 2020

ncdc commented Mar 6, 2020

JoelSpeed commented Mar 6, 2020 •

edited

Loading

enxebre commented Mar 6, 2020

detiber commented Mar 6, 2020

enxebre commented Mar 6, 2020

vincepri commented Mar 6, 2020

ncdc commented Mar 6, 2020

Node should be deleted after infraMachine is gone #2565

Node should be deleted after infraMachine is gone #2565

Comments

enxebre commented Mar 6, 2020 • edited Loading

enxebre commented Mar 6, 2020

ncdc commented Mar 6, 2020

JoelSpeed commented Mar 6, 2020 • edited Loading

enxebre commented Mar 6, 2020

detiber commented Mar 6, 2020

enxebre commented Mar 6, 2020

vincepri commented Mar 6, 2020

ncdc commented Mar 6, 2020

enxebre commented Mar 6, 2020 •

edited

Loading

JoelSpeed commented Mar 6, 2020 •

edited

Loading