Skip to content

Node should be deleted after infraMachine is gone #2565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
enxebre opened this issue Mar 6, 2020 · 8 comments · Fixed by #2570
Closed

Node should be deleted after infraMachine is gone #2565

enxebre opened this issue Mar 6, 2020 · 8 comments · Fixed by #2570
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Milestone

Comments

@enxebre
Copy link
Member

enxebre commented Mar 6, 2020

What steps did you take and what happened:
Delete a machine

What did you expect to happen:
Currently when we signal a machine for deletion we respect pod safety by honouring PDB (draining), then we delete the node, then we signal the InfraMachine for deletion.

// Drain node before deletion
if _, exists := m.ObjectMeta.Annotations[clusterv1.ExcludeNodeDrainingAnnotation]; !exists {
logger.Info("Draining node", "node", m.Status.NodeRef.Name)
if err := r.drainNode(ctx, cluster, m.Status.NodeRef.Name, m.Name); err != nil {
r.recorder.Eventf(m, corev1.EventTypeWarning, "FailedDrainNode", "error draining Machine's node %q: %v", m.Status.NodeRef.Name, err)
return ctrl.Result{}, err
}
r.recorder.Eventf(m, corev1.EventTypeNormal, "SuccessfulDrainNode", "success draining Machine's node %q", m.Status.NodeRef.Name)
}
logger.Info("Deleting node", "node", m.Status.NodeRef.Name)
var deleteNodeErr error
waitErr := wait.PollImmediate(2*time.Second, 10*time.Second, func() (bool, error) {
if deleteNodeErr = r.deleteNode(ctx, cluster, m.Status.NodeRef.Name); deleteNodeErr != nil && !apierrors.IsNotFound(deleteNodeErr) {
return false, nil
}
return true, nil
})
if waitErr != nil {
logger.Error(deleteNodeErr, "Timed out deleting node, moving on", "node", m.Status.NodeRef.Name)
r.recorder.Eventf(m, corev1.EventTypeWarning, "FailedDeleteNode", "error deleting Machine's node: %v", deleteNodeErr)
}
}
if ok, err := r.reconcileDeleteExternal(ctx, m); !ok || err != nil {
// Return early and don't remove the finalizer if we got an error or
// the external reconciliation deletion isn't ready.
return ctrl.Result{}, err
}

Node deletion is one of possible the mechanisms to force a stateful pod to be deleted.
https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/#delete-pods
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/storage/pod-safety.md#pod-safety-consistency-guarantees-and-storage-implications

If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."

To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version:
  • Minikube/KIND version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2020
@enxebre
Copy link
Member Author

enxebre commented Mar 6, 2020

cc @JoelSpeed @detiber @ncdc

@ncdc
Copy link
Contributor

ncdc commented Mar 6, 2020

👍 this has been in my head for a while but I never got around to filing. Thank you! This will also speed up deletion, because right now when we remove the Node from the apiserver, the kubelet is still running, and it's trying to update its .status, which fails. I think this causes the shutdown of the kubelet process to take longer than it otherwise would (definitely noticeable delays).

@JoelSpeed
Copy link
Contributor

JoelSpeed commented Mar 6, 2020

Should the machine controller be deleting the Node at all? I thought this was the responsibility of the K8s cloud controller manager?

From the K8s docs on cloud-controller-manager

node controller - responsible for updating kubernetes nodes using cloud APIs and deleting kubernetes nodes that were deleted on your cloud.

My expectation of how this worked would have been:

  • Machine is deleted by something
  • Machine controller cordons/drains node
  • Machine controller deletes InfrastructureMachine
  • InfrastructureMachine controller sends request to cloud provider to terminate instance
  • Instance shuts down
  • Cloud controller manager removes the Node from the cluster

I think the important part of this is that the Node is removed after kubelet has stopped as @ncdc suggested. Is there a reason we can't rely on the cloud controller manager for this and have to do it ourselves? (I'm guessing something to do with many providers, not all of them implementing the node deletion part of CCM?)

EDIT: I see this was added in #809 to ensure that nodes are cleaned up if there is no CCM

@enxebre
Copy link
Member Author

enxebre commented Mar 6, 2020

I see Cloud controller manager as something orthogonal here. We want to provide the same provider agnostic CAPI semantics/behaviour/UX and not every environment is guaranteed to have a Cloud controller manager nor to delete the node, e.g libvirt, baremetal...
Plus we are the authority source voluntarily destroying the underlying infra, I see not point on keeping the node.

@detiber
Copy link
Member

detiber commented Mar 6, 2020

Just to echo some of the points above (and expand a bit) on the reasons we added the Node deletion. It was indeed to remain consistent in cases where a provider may not have an associated CCM, but also to provide consistency in cases where there were different behaviors exhibited by CCMs for different providers. I believe maintaining consistency here is important.

We can definitely talk about trying to get to a point in the future where there is a requirement to have a CCM for a given infrastructure provider, however we would also have to make sure that there is consistent behavior between them as well.

@enxebre
Copy link
Member Author

enxebre commented Mar 6, 2020

/assign @enxebre

enxebre added a commit to enxebre/cluster-api that referenced this issue Mar 6, 2020
If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."

To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.

kubernetes-sigs#2565
@vincepri
Copy link
Member

vincepri commented Mar 6, 2020

/milestone v0.3.x

@k8s-ci-robot k8s-ci-robot added this to the v0.3.x milestone Mar 6, 2020
@ncdc
Copy link
Contributor

ncdc commented Mar 6, 2020

/lifecycle active

@k8s-ci-robot k8s-ci-robot added the lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor. label Mar 6, 2020
enxebre added a commit to enxebre/cluster-api that referenced this issue Mar 9, 2020
If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."

To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.
kubernetes-sigs#2565
enxebre added a commit to enxebre/cluster-api that referenced this issue Mar 9, 2020
If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."

To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.
kubernetes-sigs#2565
enxebre added a commit to enxebre/cluster-api that referenced this issue Mar 9, 2020
If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."

To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.
kubernetes-sigs#2565
enxebre added a commit to enxebre/cluster-api that referenced this issue Mar 9, 2020
If there were an edge case where we deleted an unreachable node and therefore removing any pod from etcd, but an stateful process were still running on the underlying host, we'll be immediately freeing up the pod name from the apiserver. "This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod, and if said Pod can still communicate with the other members of the StatefulSet, will violate the at most one semantics that StatefulSet is designed to guarantee."

To mitigate this it'd be safer to delete the node only after the owned infraMachine is gone. This would give stronger guarantees that the underlying host is destroyed and so there's no chance of a stateful "leaked" process still running.
kubernetes-sigs#2565
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants