Skip to content

Commit 06fb511

Browse files
committed
Add Node drain documentation
Signed-off-by: Stefan Büringer [email protected]
1 parent 3232abc commit 06fb511

File tree

5 files changed

+140
-1
lines changed

5 files changed

+140
-1
lines changed

docs/book/src/SUMMARY.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@
2727
- [Scaling](./tasks/automated-machine-management/scaling.md)
2828
- [Autoscaling](./tasks/automated-machine-management/autoscaling.md)
2929
- [Healthchecking](./tasks/automated-machine-management/healthchecking.md)
30+
- [Machine deletion process](./tasks/automated-machine-management/machine_deletions.md)
3031
- [Experimental Features](./tasks/experimental-features/experimental-features.md)
3132
- [MachinePools](./tasks/experimental-features/machine-pools.md)
3233
- [MachineSetPreflightChecks](./tasks/experimental-features/machineset-preflight-checks.md)

docs/book/src/tasks/automated-machine-management/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,4 @@ This section details some tasks related to automated Machine management.
55
- [Scaling](./scaling.md)
66
- [Autoscaling](./autoscaling.md)
77
- [Healthchecking](./healthchecking.md)
8+
- [Machine deletion process](./machine_deletions.md)
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
# Machine deletion process
2+
3+
Machine deletions occur in various cases, for example:
4+
* Control plane (e.g. KCP) or MachineDeployment rollouts
5+
* Machine remediations
6+
* Scale downs of MachineDeployments
7+
8+
This page describes how Cluster API deletes Machines.
9+
10+
Machine deletion can be broken down into the following phases:
11+
1. Machine deletion is triggered (i.e. the `metadata.deletionTimestamp` is set)
12+
2. Machine controller waits until all pre-drain hooks succeeded, if any are registered
13+
* Pre-drain hooks can be registered by adding annotations with the `pre-drain.delete.hook.machine.cluster.x-k8s.io` prefix to the Machine object
14+
3. Machine controller checks if the Machine should be drained, drain is skipped if:
15+
* The Machine has the `machine.cluster.x-k8s.io/exclude-node-draining` annotation
16+
* The `Machine.spec.nodeDrainTimeout` field is set and already expired (unset or `0` means no timeout)
17+
4. If the Machine should be drained, the Machine controller evicts all relevant Pods from the Node (see details in [Node drain](#node-drain))
18+
5. Machine controller checks if we should wait until all volumes are detached, this is skipped if:
19+
* The Machine has the `machine.cluster.x-k8s.io/exclude-wait-for-node-volume-detach` annotation
20+
* The `Machine.spec.nodeVolumeDetachTimeout` field is set and already expired (unset or `0` means no timeout)
21+
6. If we should wait for volume detach, the Machine controller waits until `Node.status.volumesAttached` is empty
22+
* Typically the volumes are getting detached by CSI after the corresponding Pods have been evicted during drain
23+
7. Machine controller waits until all pre-terminate hooks succeeded, if any are registered
24+
* Pre-terminate hooks can be registered by adding annotations with the `pre-terminate.delete.hook.machine.cluster.x-k8s.io` prefix to the Machine object
25+
8. Machine controller deletes the `InfrastructureMachine` object (e.g. `DockerMachine`) of the Machine and waits until it is gone
26+
9. Machine controller deletes the `BootstrapConfig` object (e.g. `KubeadmConfig`) of the machine and waits until it is gone
27+
10. Machine controller deletes the Node object in the workload cluster
28+
* Node deletion will be retried until either the Node object is gone or `Machine.spec.nodeDeletionTimeout` is expired (`0` means no timeout, but the field defaults to 10s)
29+
30+
Note: There are cases where Node drain, wait for volume detach and Node deletion is skipped. For these please take a look at the
31+
implementation of the [`isDeleteNodeAllowed` function](https://github.com/kubernetes-sigs/cluster-api/blob/v1.8.0/internal/controllers/machine/machine_controller.go#L346).
32+
33+
## Node drain
34+
35+
This section describes details of the Node drain process in Cluster API. Cluster API implements Node drain aligned
36+
with `kubectl drain`. One major difference is that the Cluster API controller does not actively wait during `Reconcile`
37+
until all Pods are drained from the Node. Instead it continuously evicts Pods and requeues after 20s until all relevant
38+
Pods have been drained from the Node or until the `Machine.spec.nodeDrainTimeout` is reached (if configured).
39+
40+
Node drain can be broken down into the following phases:
41+
* Node is cordoned (i.e. the `Node.spec.unschedulable` field is set, which leads to the `node.kubernetes.io/unschedulable:NoSchedule` taint being added to the Node)
42+
* This prevents that Pods that already have been evicted are rescheduled to the same Node. Please only tolerate this taint
43+
if you know what you are doing! Otherwise it can happen that the Machine controller is stuck continuously evicting the same Pods.
44+
* Machine controller calculates the list of Pods that should be evicted. These are all Pods on the Node, except:
45+
* Pods belonging to an existing DaemonSet (orphaned DaemonSet Pods have to be evicted as well)
46+
* Mirror Pods, i.e. Pods with the `kubernetes.io/config.mirror` annotation (usually static Pods managed by kubelet, like `kube-apiserver`)
47+
* If there are no (more) Pods that have to be evicted and all Pods that have been evicted are gone, Node drain is completed
48+
* Otherwise an eviction will be triggered for all Pods that have to be evicted. There are various reasons why an eviction call could fail:
49+
* The eviction would violate a PodDisruptionBudget, i.e. not enough Pod replicas would be available if the Pod would be evicted
50+
* The namespace is in terminating, in this case the `kube-controller-manager` is responsible for setting the `.metadata.deletionTimestamp` on the Pod
51+
* Other errors, e.g. a connection issue when calling the eviction API at the workload cluster
52+
* Please note that when an eviction goes through, this only means that the `.metadata.deletionTimestamp` is set on the Pod, but the
53+
Pod also has to be terminated and the Pod object has to go away for the drain to complete.
54+
* These steps are repeated every 20s until all relevant Pods have been drained from the Node
55+
56+
Special cases:
57+
* If the Node doesn't exist anymore, Node drain is entirely skipped
58+
* If the Node is `unreachable` (i.e. the Node `Ready` condition is in status `Unknown`):
59+
* Pods with `.metadata.deletionTimestamp` more than 1s in the past are ignored
60+
* Pod evictions will use 1s `GracePeriodSeconds`, i.e. the `terminationGracePeriodSeconds` field from the Pod spec will be ignored.
61+
* Note: PodDisruptionBudgets are still respected, because both of these changes are only relevant if the call to trigger the Pod eviction goes through.
62+
But Pod eviction calls are rejected when PodDisruptionBudgets would be violated by the eviction.
63+
64+
### Observability
65+
66+
The drain process can be observed through the `DrainingSucceeded` condition on the Machine and various logs.
67+
68+
**Example condition**
69+
70+
To determine which Pods are blocking the drain and why you can take a look at the `DrainingSucceeded` condition on the Machine, e.g.:
71+
```yaml
72+
status:
73+
...
74+
conditions:
75+
...
76+
- lastTransitionTime: "2024-08-30T13:36:27Z"
77+
message: |-
78+
Drain not completed yet:
79+
* Pods with deletionTimestamp that still exist: cert-manager/cert-manager-756d54fb98-hcb6k
80+
* Pods with eviction failed:
81+
* Cannot evict pod as it would violate the pod's disruption budget. The disruption budget nginx needs 10 healthy pods and has 10 currently: test-namespace/nginx-deployment-6886c85ff7-2jtqm, test-namespace/nginx-deployment-6886c85ff7-7ggsd, test-namespace/nginx-deployment-6886c85ff7-f6z4s, ... (7 more)
82+
reason: Draining
83+
severity: Info
84+
status: "False"
85+
type: DrainingSucceeded
86+
```
87+
88+
**Example logs**
89+
90+
When cordoning the Node:
91+
```text
92+
I0830 12:50:13.961156 17 machine_controller.go:716] "Cordoning Node" ... Node="my-cluster-md-0-wxtcg-mtg57-k9qvz"
93+
```
94+
95+
When starting the drain:
96+
```text
97+
I0830 12:50:13.961156 17 machine_controller.go:716] "Draining Node" ... Node="my-cluster-md-0-wxtcg-mtg57-k9qvz"
98+
```
99+
100+
Immediately before Pods are evicted:
101+
```text
102+
I0830 12:52:58.739093 17 drain.go:172] "Drain not completed yet, there are still Pods on the Node that have to be drained" ... Node="my-cluster-md-0-wxtcg-mtg57-ssfg8" podsToTriggerEviction="test-namespace/nginx-deployment-6886c85ff7-4r297, test-namespace/nginx-deployment-6886c85ff7-5gl2h, test-namespace/nginx-deployment-6886c85ff7-64tf9, test-namespace/nginx-deployment-6886c85ff7-9k5gp, test-namespace/nginx-deployment-6886c85ff7-9mdjw, ... (5 more)" podsWithDeletionTimestamp="kube-system/calico-kube-controllers-7dc5458bc6-rdjj4, kube-system/coredns-7db6d8ff4d-9cbhn"
103+
```
104+
105+
On log level 4 it is possible to observe details of the Pod evictions, e.g.:
106+
```text
107+
I0830 13:29:56.211951 17 drain.go:224] "Evicting Pod" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" Pod="test-namespace/nginx-deployment-6886c85ff7-77fpw"
108+
I0830 13:29:56.211951 17 drain.go:229] "Pod eviction successfully triggered" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" Pod="test-namespace/nginx-deployment-6886c85ff7-77fpw"
109+
```
110+
111+
After Pods have been evicted, either the drain is directly completed:
112+
```text
113+
I0830 13:29:56.235398 17 machine_controller.go:727] "Drain completed, remaining Pods on the Node have been evicted" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh"
114+
```
115+
116+
or we are requeuing:
117+
```text
118+
I0830 13:29:56.235398 17 machine_controller.go:736] "Drain not completed yet, requeuing in 20s" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh" podsFailedEviction="test-namespace/nginx-deployment-6886c85ff7-77fpw, test-namespace/nginx-deployment-6886c85ff7-8dq4q, test-namespace/nginx-deployment-6886c85ff7-8gjhf, test-namespace/nginx-deployment-6886c85ff7-jznjw, test-namespace/nginx-deployment-6886c85ff7-l5nj8, ... (5 more)" podsWithDeletionTimestamp="kube-system/calico-kube-controllers-7dc5458bc6-rdjj4, kube-system/coredns-7db6d8ff4d-9cbhn"
119+
```
120+
121+
Eventually the Machine controller should log
122+
```text
123+
I0830 13:29:56.235398 17 machine_controller.go:702] "Drain completed" ... Node="my-cluster-2-md-0-wxtcg-mtg57-24lvh"
124+
```
125+
126+
If this doesn't happen, please take a closer at the logs to determine which Pods still have to be evicted or haven't gone away yet
127+
(i.e. deletionTimestamp is set but the Pod objects still exist).
128+
129+
### Related documentation
130+
131+
For more information, please see:
132+
* [Disruptions: Pod disruption budgets](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets)
133+
* [Specifying a Disruption Budget for your Application](https://kubernetes.io/docs/tasks/run-application/configure-pdb/)
134+
* [API-initiated eviction](https://kubernetes.io/docs/concepts/scheduling-eviction/api-eviction/)

internal/controllers/machine/drain/drain.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,9 @@ func (d *Helper) CordonNode(ctx context.Context, node *corev1.Node) error {
5959
return nil
6060
}
6161

62+
log := ctrl.LoggerFrom(ctx)
63+
log.Info("Cordoning Node")
64+
6265
patch := client.MergeFrom(node.DeepCopy())
6366
node.Spec.Unschedulable = true
6467
if err := d.Client.Patch(ctx, node, patch); err != nil {

internal/controllers/machine/machine_controller.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -702,7 +702,7 @@ func (r *Reconciler) drainNode(ctx context.Context, cluster *clusterv1.Cluster,
702702
podsToBeDrained := podDeleteList.Pods()
703703

704704
if len(podsToBeDrained) == 0 {
705-
log.Info("Drain completed, no Pods to drain on the Node")
705+
log.Info("Drain completed")
706706
return ctrl.Result{}, nil
707707
}
708708

0 commit comments

Comments
 (0)