|
| 1 | +# Machine deletion process |
| 2 | + |
| 3 | +Machine deletions occur in various cases, for example: |
| 4 | +* Control plane (e.g. KCP) or MachineDeployment rollouts |
| 5 | +* Machine remediations |
| 6 | +* Scale downs of MachineDeployments |
| 7 | + |
| 8 | +This page describes how Cluster API deletes Machines. |
| 9 | + |
| 10 | +Machine deletion can be broken down into the following phases: |
| 11 | +1. Machine deletion is triggered (i.e. the `metadata.deletionTimestamp` is set) |
| 12 | +2. Machine controller waits until all pre-drain hooks succeeded, if any are registered |
| 13 | + * Pre-drain hooks can be registered by adding annotations with the `pre-drain.delete.hook.machine.cluster.x-k8s.io` prefix to the Machine object |
| 14 | +3. Machine controller checks if the Machine should be drained, drain is skipped if: |
| 15 | + * The Machine has the `machine.cluster.x-k8s.io/exclude-node-draining` annotation |
| 16 | + * The `Machine.spec.nodeDrainTimeout` field is set and already expired (unset or `0` means no timeout) |
| 17 | +4. If the Machine should be drained, the Machine controller evicts all relevant Pods from the Node (see details in [Node drain](#node-drain)) |
| 18 | +5. Machine controller checks if we should wait until all volumes are detached, this is skipped if: |
| 19 | + * The Machine has the `machine.cluster.x-k8s.io/exclude-wait-for-node-volume-detach` annotation |
| 20 | + * The `Machine.spec.nodeVolumeDetachTimeout` field is set and already expired (unset or `0` means no timeout) |
| 21 | +6. If we should wait for volume detach, the Machine controller waits until `Node.status.volumesAttached` is empty |
| 22 | + * Typically the volumes are getting detached by CSI after the corresponding Pods have been evicted during drain |
| 23 | +7. Machine controller waits until all pre-terminate hooks succeeded, if any are registered |
| 24 | + * Pre-terminate hooks can be registered by adding annotations with the `pre-terminate.delete.hook.machine.cluster.x-k8s.io` prefix to the Machine object |
| 25 | +8. Machine controller deletes the `InfrastructureMachine` object (e.g. `DockerMachine`) of the Machine and waits until it is gone |
| 26 | +9. Machine controller deletes the `BootstrapConfig` object (e.g. `KubeadmConfig`) of the machine and waits until it is gone |
| 27 | +10. Machine controller deletes the Node object in the workload cluster |
| 28 | + * Node deletion will be retried until either the Node object is gone or `Machine.spec.nodeDeletionTimeout` is expired (`0` means no timeout, but the field defaults to 10s) |
| 29 | + |
| 30 | +Note: There are cases where Node drain, wait for volume detach and Node deletion is skipped. For these please take a look at the |
| 31 | +implementation of the [`isDeleteNodeAllowed` function](https://github.com/kubernetes-sigs/cluster-api/blob/v1.8.0/internal/controllers/machine/machine_controller.go#L346). |
| 32 | + |
| 33 | +## Node drain |
| 34 | + |
| 35 | +This section describes details of the Node drain process. |
| 36 | + |
| 37 | +Node drain can be broken down into the following phases: |
| 38 | +* Node is cordoned (i.e. the `Node.spec.unschedulable` field is set, which leads to the `node.kubernetes.io/unschedulable:NoSchedule` taint being added to the Node) |
| 39 | + * This prevents that Pods that already have been evicted are rescheduled to the same Node. Please only tolerate this taint |
| 40 | + if you know what you are doing! Otherwise it can happen that the Machine controller is stuck continuously evicting the same Pods. |
| 41 | +* Machine controller calculates the list of Pods that should be evicted. These are all Pods on the Node, except: |
| 42 | + * Pods belonging to an existing DaemonSet (orphaned DaemonSet Pods have to be evicted as well) |
| 43 | + * Mirror Pods, i.e. Pods with the `kubernetes.io/config.mirror` annotation (usually static Pods managed by kubelet, like `kube-apiserver`) |
| 44 | +* If there are no (more) Pods that have to be evicted and all Pods that have been evicted are gone, Node drain is completed |
| 45 | +* Otherwise an eviction will be triggered for all Pods that have to be evicted. There are various reasons why an eviction call could fail: |
| 46 | + * The eviction would violate a PodDisruptionBudget, i.e. not enough Pod replicas would be available if the Pod would be evicted |
| 47 | + * The namespace is in terminating, in this case the `kube-controller-manager` is responsible for setting the `.metadata.deletionTimestamp` on the Pod |
| 48 | + * Other errors, e.g. a connection issue when calling the eviction API at the workload cluster |
| 49 | +* Please note that when an eviction goes through, this only means that the `.metadata.deletionTimestamp` is set on the Pod, but the |
| 50 | + Pod also has to be terminated and the Pod object has to go away for the drain to complete. |
| 51 | +* These steps are repeated every 20s until all relevant Pods have been drained from the Node |
| 52 | + |
| 53 | +Special cases: |
| 54 | +* If the Node doesn't exist anymore, Node drain is entirely skipped |
| 55 | +* If the Node is `unreachable` (i.e. the Node `Ready` condition is in status `Unknown`): |
| 56 | + * Pods with `.metadata.deletionTimestamp` more than 1s in the past are ignored |
| 57 | + * Pod evictions will use 1s `GracePeriodSeconds`, i.e. the `terminationGracePeriodSeconds` field from the Pod spec will be ignored. |
| 58 | + * Note: PodDisruptionBudgets are still respected, because both of these changes are only relevant if the call to trigger the Pod eviction goes through. |
| 59 | + But Pod eviction calls are rejected when PodDisruptionBudgets would be violated by the eviction. |
| 60 | + |
| 61 | +### Observability |
| 62 | + |
| 63 | +The drain process can be observed through the `DrainingSucceeded` condition on the Machine and various logs. |
| 64 | + |
| 65 | +**Example condition** |
| 66 | + |
| 67 | +To determine which Pods are blocking the drain and why you can take a look at the `DrainingSucceeded` condition on the Machine, e.g.: |
| 68 | +```yaml |
| 69 | +status: |
| 70 | + ... |
| 71 | + conditions: |
| 72 | + ... |
| 73 | + - lastTransitionTime: "2024-08-30T13:36:27Z" |
| 74 | + message: |- |
| 75 | + Drain not completed yet: |
| 76 | + * Pods with deletionTimestamp that still exist: cert-manager/cert-manager-756d54fb98-hcb6k |
| 77 | + * Pods with eviction failed: |
| 78 | + * Cannot evict pod as it would violate the pod's disruption budget. The disruption budget nginx needs 10 healthy pods and has 10 currently: test-namespace/nginx-deployment-6886c85ff7-2jtqm, test-namespace/nginx-deployment-6886c85ff7-7ggsd, test-namespace/nginx-deployment-6886c85ff7-f6z4s, ... (7 more) |
| 79 | + reason: Draining |
| 80 | + severity: Info |
| 81 | + status: "False" |
| 82 | + type: DrainingSucceeded |
| 83 | +``` |
| 84 | +
|
| 85 | +**Example logs** |
| 86 | +
|
| 87 | +When cordoning the Node: |
| 88 | +```text |
| 89 | +I0830 12:50:13.961156 17 machine_controller.go:716] "Cordoning Node" ... Node="my-cluster-md-0-wxtcg-mtg57-k9qvz" |
| 90 | +``` |
| 91 | + |
| 92 | +When starting the drain: |
| 93 | +```text |
| 94 | +I0830 12:50:13.961156 17 machine_controller.go:716] "Draining Node" ... Node="my-cluster-md-0-wxtcg-mtg57-k9qvz" |
| 95 | +``` |
| 96 | + |
| 97 | +Immediately before Pods are evicted: |
| 98 | +```text |
| 99 | +I0830 12:52:58.739093 17 drain.go:172] "Drain not completed yet, there are still Pods on the Node that have to be drained" ... Node="my-cluster-md-0-wxtcg-mtg57-ssfg8" podsToTriggerEviction="test-namespace/nginx-deployment-6886c85ff7-4r297, test-namespace/nginx-deployment-6886c85ff7-5gl2h, test-namespace/nginx-deployment-6886c85ff7-64tf9, test-namespace/nginx-deployment-6886c85ff7-9k5gp, test-namespace/nginx-deployment-6886c85ff7-9mdjw, ... (5 more)" podsWithDeletionTimestamp="kube-system/calico-kube-controllers-7dc5458bc6-rdjj4, kube-system/coredns-7db6d8ff4d-9cbhn" |
| 100 | +``` |
| 101 | + |
| 102 | +On log level 4 it is possible to observe details of the Pod evictions, e.g.: |
| 103 | +```text |
| 104 | +I0830 13:29:56.211951 17 drain.go:224] "Evicting Pod" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" Pod="test-namespace/nginx-deployment-6886c85ff7-77fpw" |
| 105 | +I0830 13:29:56.211951 17 drain.go:229] "Pod eviction successfully triggered" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" Pod="test-namespace/nginx-deployment-6886c85ff7-77fpw" |
| 106 | +``` |
| 107 | + |
| 108 | +After Pods have been evicted, either the drain is directly completed: |
| 109 | +```text |
| 110 | +I0830 13:29:56.235398 17 machine_controller.go:727] "Drain completed, remaining Pods on the Node have been evicted" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" |
| 111 | +``` |
| 112 | + |
| 113 | +or we are requeuing: |
| 114 | +```text |
| 115 | +I0830 13:29:56.235398 17 machine_controller.go:736] "Drain not completed yet, requeuing in 20s" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" podsFailedEviction="test-namespace/nginx-deployment-6886c85ff7-77fpw, test-namespace/nginx-deployment-6886c85ff7-8dq4q, test-namespace/nginx-deployment-6886c85ff7-8gjhf, test-namespace/nginx-deployment-6886c85ff7-jznjw, test-namespace/nginx-deployment-6886c85ff7-l5nj8, ... (5 more)" podsWithDeletionTimestamp="kube-system/calico-kube-controllers-7dc5458bc6-rdjj4, kube-system/coredns-7db6d8ff4d-9cbhn" |
| 116 | +``` |
| 117 | + |
| 118 | +Eventually the Machine controller should log |
| 119 | +```text |
| 120 | +I0830 13:29:56.235398 17 machine_controller.go:702] "Drain completed" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" |
| 121 | +``` |
| 122 | + |
| 123 | +If this doesn't happen, please take a closer at the logs to determine which Pods still have to be evicted or haven't gone away yet |
| 124 | +(i.e. deletionTimestamp is set but the Pod objects still exist). |
| 125 | + |
| 126 | +### Related documentation |
| 127 | + |
| 128 | +For more information, please see: |
| 129 | +* [Disruptions: Pod disruption budgets](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets) |
| 130 | +* [Specifying a Disruption Budget for your Application](https://kubernetes.io/docs/tasks/run-application/configure-pdb/) |
| 131 | +* [API-initiated eviction](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/) |
0 commit comments