diff --git a/keps/sig-apps/4212-declarative-node-maintenance/README.md b/keps/sig-apps/4212-declarative-node-maintenance/README.md new file mode 100644 index 00000000000..61560c220df --- /dev/null +++ b/keps/sig-apps/4212-declarative-node-maintenance/README.md @@ -0,0 +1,2304 @@ + +# KEP-4212: Declarative Node Maintenance + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Cluster Autoscaler](#cluster-autoscaler) + - [kubelet](#kubelet) + - [Motivation Summary](#motivation-summary) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Story 3](#story-3) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Kubectl](#kubectl) + - [NodeMaintenance API](#nodemaintenance-api) + - [NodeMaintenance Admission](#nodemaintenance-admission) + - [NodeMaintenance Controller](#nodemaintenance-controller) + - [Idle](#idle) + - [Finalizers and Deletion of the NodeMaintenance](#finalizers-and-deletion-of-the-nodemaintenance) + - [Cordon](#cordon) + - [Uncordon (Complete)](#uncordon-complete) + - [Drain](#drain) + - [Pod Selection](#pod-selection) + - [Pod Selection and DrainTargets Example](#pod-selection-and-draintargets-example) + - [PodTypes and Label Selectors Progression](#podtypes-and-label-selectors-progression) + - [Status](#status) + - [Supported Stage Transitions](#supported-stage-transitions) + - [DaemonSet Controller](#daemonset-controller) + - [kubelet: Graceful Node Shutdown](#kubelet-graceful-node-shutdown) + - [kubelet: Static Pods](#kubelet-static-pods) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Out-of-tree Implementation](#out-of-tree-implementation) + - [Use a Node Object Instead of Introducing a New NodeMaintenance API](#use-a-node-object-instead-of-introducing-a-new-nodemaintenance-api) + - [Use Taint Based Eviction for Node Maintenance](#use-taint-based-eviction-for-node-maintenance) + - [Names considered for the new API](#names-considered-for-the-new-api) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +This KEP proposes adding a declarative API to manage node maintenance. This API can be used to +implement additional capabilities around node draining. + +## Motivation + +The goal of this KEP is to analyze and improve node maintenance in Kubernetes. + +Node maintenance is a request from a cluster administrator to remove all pods from a node(s) so +that it can be disconnected from the cluster to perform a software upgrade (OS, kubelet, etc.), +hardware or firmware upgrade, or simply to remove the node as it is no longer needed. + +Kubernetes has existing support for this use case in the following way with `kubectl drain`: +1. There are running pods on node A, some of which are protected with PodDisruptionBudgets (PDB). +2. Set the node `Unschedulable` (cordon) to prevent new pods from being scheduled there. +3. Evict (default behavior) pods from node A by using the eviction API (see [kubectl drain worklflow](https://raw.githubusercontent.com/kubernetes/website/f2ef324ac22e5d9378f2824af463777182817ca6/static/images/docs/kubectl_drain.svg)). +4. Proceed with the maintenance and shut down or restart the node. +5. On platforms and nodes that support it, the kubelet will try to detect the imminent shutdown and + then attempt to perform a [Graceful Node Shutdown](https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown): + - delay the shutdown pending graceful termination of remaining pods + - terminate remaining pods in reverse priority order (see [pod-priority-graceful-node-shutdown](https://kubernetes.io/docs/concepts/architecture/nodes/#pod-priority-graceful-node-shutdown)) + +The main problem is that the current approach tries to solve this in an application agnostic way +and will simply attempt to remove all the pods currently running on the node. Since this approach +cannot be applied generically to all pods, the Kubernetes project has defined special +[drain filters](https://github.com/kubernetes/kubernetes/blob/56cc5e77a10ba156694309d9b6159d4cd42598e1/staging/src/k8s.io/kubectl/pkg/drain/filters.go#L153-L162) +that either skip groups of pods or an admin has to consent to override those groups to be either +skipped or deleted. This means that without knowledge of all the underlying applications on the +cluster, the admin has to make a potentially harmful decision. + +From an application owner or developer perspective, the only standard tool they have is +a PodDisruptionBudget. This is sufficient in a basic scenario with a simple multi-replica +application. The edge case applications, where this does not work are very important to +the cluster admin, as they can block the node drain. And, in turn, very important to the +application owner, as the admin can then override the pod disruption budget and disrupt their +sensitive application anyway. + +List of cases where the current solution is not optimal: + +1. Without extra manual effort, an application running with a single replica has to settle for + experiencing application downtime during the node drain. They cannot use PDBs with + `minAvailable: 1` or `maxUnavailable: 0`, or they will block node maintenance. Not every user + needs high availability either, due to a preference for a simpler deployment model, lack of + application support for HA, or to minimize compute costs. Also, any automated solution needs + to edit the PDB to account for the additional pod that needs to be spun to move the workload + from one node to another. This has been discussed in the issue [kubernetes/kubernetes#66811](https://github.com/kubernetes/kubernetes/issues/66811) + and in the issue [kubernetes/kubernetes#114877](https://github.com/kubernetes/kubernetes/issues/114877). +2. Similar to the first point, it is difficult to use PDBs for applications that can have a variable + number of pods; for example applications with a configured horizontal pod autoscaler (HPA). These + applications cannot be disrupted during a low load when they have only one pod. However, it is + possible to disrupt the pods during a high load of the application (pods > 1) without + experiencing application downtime. If the minimum number of pods is 1, PDBs cannot be used + without blocking the node drain. This has been discussed in the issue [kubernetes/kubernetes#93476](https://github.com/kubernetes/kubernetes/issues/93476). +3. Graceful termination of DaemonSet pods is currently only supported on Linux as part of Graceful + Node Shutdown feature. The length of the shutdown is again not application specific and is set + cluster-wide (optionally by priority) by the cluster admin. This only partially + [takes into account](https://github.com/kubernetes/kubernetes/blob/a31030543c47aac36cf323b885cfb6d8b0a2435f/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L368-L373) + `.spec.terminationGracePeriodSeconds` of each pod and may cause premature termination of the + application. This has been discussed in the issue [kubernetes/kubernetes#75482](https://github.com/kubernetes/kubernetes/issues/75482) + and in the issue [kubernetes-sigs/cluster-api#6158](https://github.com/kubernetes-sigs/cluster-api/issues/6158). +4. There are cases during a node shutdown, when data corruption can occur due to premature node + shutdown. It would be great if applications could perform data migration and synchronization of + cached writes to the underlying storage before the pod deletion occurs. This is not easy to + quantify even with pod's `.spec.shutdownGracePeriod`, as the time depends on the size of the data + and the speed of the storage. This has been discussed in the issue [kubernetes/kubernetes#116618](https://github.com/kubernetes/kubernetes/issues/116618) + and in the issue [kubernetes/kubernetes#115148](https://github.com/kubernetes/kubernetes/issues/115148). +5. During the Graceful Node Shutdown the kubelet terminates the pods in order of their priority. + The DaemonSet controller runs its own scheduling logic and creates the pods again. This causes a + race. Such pods should be removed and not recreated, but higher priority pods that have not yet + been terminated should be recreated if they are missing. This has been discussed in the issue + [kubernetes/kubernetes#122912](https://github.com/kubernetes/kubernetes/issues/122912). +6. The Graceful Node Shutdown feature is not always reliable. If Dbus or kubelet is restarted + during the shutdown, pods may be ungracefully terminated, leading to application disruption and + data loss. New applications can get scheduled on such a node which can also be harmful. + This has been discussed in issues [kubernetes/kubernetes#122674](https://github.com/kubernetes/kubernetes/issues/122674), + [kubernetes/kubernetes#120613](https://github.com/kubernetes/kubernetes/issues/120613) and [kubernetes/kubernetes#122674](https://github.com/kubernetes/kubernetes/issues/112443). +7. There is no way to gracefully terminate static pods during a node shutdown + [kubernetes/kubernetes#122674](https://github.com/kubernetes/kubernetes/issues/122674), and the + lifecycle/termination is not clearly defined for static pods [kubernetes/kubernetes#16627](https://github.com/kubernetes/kubernetes/issues/16627). +8. Different pod termination mechanisms are not synchronized with each other. So for example, the + taint manager may prematurely terminate pods that are currently under Node Graceful Shutdown. + This can also happen with other mechanism (e.g., different types of evictions). This has been + discussed in the issue [kubernetes/kubernetes#124448](https://github.com/kubernetes/kubernetes/issues/124448) + and in the issue [kubernetes/kubernetes#72129](https://github.com/kubernetes/kubernetes/issues/72129). +9. There is not enough metadata about why the node drain was requested or why the pods are + terminating. This has been discussed in the issue [kubernetes/kubernetes#30586](https://github.com/kubernetes/kubernetes/issues/30586) + and in the issue [kubernetes/kubernetes#116965](https://github.com/kubernetes/kubernetes/issues/116965). + +Approaches and workarounds used by other projects to deal with these shortcomings: +- https://github.com/medik8s/node-maintenance-operator uses a declarative approach that tries to + mimic `kubectl drain` (and uses kubectl implementation under the hood). +- https://github.com/kubereboot/kured performs automatic node reboots and relies on `kubectl drain` + implementation to achieve that. +- https://github.com/strimzi/drain-cleaner prevents Kafka or ZooKeeper pods from being drained + until they are fully synchronized. Implemented by intercepting eviction requests with a + validating admission webhook. The synchronization is also protected by a PDB with the + `.spec.maxUnavailable` field set to 0. See the experience reports for more information. +- https://github.com/kubevirt/kubevirt intercepts eviction requests with a validating admission + webhook to block eviction and to start a virtual machine live migration from one node to another. + Normally, the workload is also guarded by a PDB with the `.spec.minAvailable` field set to 1. + During the migration the value is increased to 2. +- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler has an eviction process + that takes inspiration from kubectl and build additional logic on top of it. + See [Cluster Autoscaler](#cluster-autoscaler) for more details. +- https://github.com/kubernetes-sigs/karpenter taints the node during the node drain. It then + attempts to evict all the pods on the node by calling the Eviction API. It prioritizes + non-critical pods and non-DaemonSet pods +- https://github.com/aws/aws-node-termination-handler watches for a predefined set of events + (spot instance termination, EC2 termination, etc.), then cordons and drains the node. It relies + on the `kubectl` implementation. +- https://github.com/openshift/machine-config-operator updates/drains nodes by using a cordon and + relies on the `kubectl drain` implementation. +- https://github.com/foriequal0/pod-graceful-drain intercepts eviction/deletion requests to + gracefully and slowly terminate the pod. + +Experience Reports: +- Federico Valeri, [Drain Cleaner: What's this?](https://strimzi.io/blog/2021/09/24/drain-cleaner/), Sep 24, 2021, description + of the use case and implementation of drain cleaner +- Tommer Amber, [Solution!! Avoid Kubernetes/Openshift Node Drain Failure due to active PodDisruptionBudget](https://medium.com/@tamber/solution-avoid-kubernetes-openshift-node-drain-failure-due-to-active-poddisruptionbudget-df68efed2c4f), Apr 30, 2022 - user + is unhappy about the manual intervention required to perform node maintenance and gives the + unfortunate advice to cluster admins to simply override the PDBs. This can have negative + consequences for user applications, including data loss. This also discourages the use of PDBs. + We have also seen an interest in the issue [kubernetes/kubernetes#83307](https://github.com/kubernetes/kubernetes/issues/83307) + for overriding evictions, which led to the addition of the `--disable-eviction` flag to + `kubectl drain`. There are other examples of this approach on the web . +- Kevin Reeuwijk, [How to handle blocking PodDisruptionBudgets on K8s with distributed storage](https://www.spectrocloud.com/blog/how-to-handle-blocking-poddisruptionbudgets-on-kubernetes-with-distributed-storage), June 6, 2022 - a simple + shell script example on how to drain the node in a safer way. It does a normal eviction, then + looks for a pet application (Rook-Ceph in this case) and does hard delete if it does not see it. + This approach is not plagued by the loss of data resiliency, but it does require maintenaning a + list of pet applications, which can be prone to mistakes. In the end, the cluster admin has to do + a job of the application maintainer. +- Artur Rodrigues, [Impossible Kubernetes node drains](https://www.artur-rodrigues.com/tech/2023/03/30/impossible-kubectl-drains.html), 30 Mar, 2023 - discusses + the problem with node drains and offers a workaround to restart the application without the + application owner's consents, but acknowledges that this may be problematic without the knowledge + of the application +- Jack Roper, [How to Delete Pods from a Kubernetes Node with Examples](https://spacelift.io/blog/kubectl-delete-pod), 05 Jul, 2023 - also + discusses the problem of blocking PDBs and offers several workarounds. Similar to others also + offers a force deletion, but also a less destructive method of scaling up the application. + However, this also interferes with application deployment and has to be supported by the + application. + +### Cluster Autoscaler + +Accepts a `drain-priority-config` option, which is similar to Graceful Node Shutdown in that it +gives each priority a shutdown grace period. Also has a `max-graceful-termination-sec` option for +pod termination and a `max-pod-eviction-time` option after which the eviction is forfeited. + +Each pod is first analyzed to see if it is drainable. Part of the logic is similar to kubectl and +its drain filters (see [Cluster Autoscaler rules](https://github.com/kubernetes/autoscaler/blob/554366f979b11aeb82df335a793e4d7a1acfadb4/cluster-autoscaler/simulator/drainability/rules/rules.go#L50-L77)): +- Mirror pods are skipped. +- Terminating pods are skipped. +- Pods and ReplicaSets/ReplicationControllers without owning controllers are blocking by default + (the check can be modified with the `skip-nodes-with-custom-controller-pods` option). +- System pods (in the `kube-system` namespace) without a matching PDB are blocking by default + (the check can be modified with the `skip-nodes-with-system-pods` option). +- Pods with `cluster-autoscaler.kubernetes.io/safe-to-evict: "false"` annotation are blocking. +- Pods with local storage are blocking unless they have a + `cluster-autoscaler.kubernetes.io/safe-to-evict-local-volumes` annotation + (the check can be modified with the `skip-nodes-with-local-storage` option, e.g., this check + is skipped on [AKS](https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler?tabs=azure-cli#cluster-autoscaler-profile-settings)). +- Pods with PDBs that do not have `disruptionsAllowed` are blocking. + +This can be enhanced with other rules and overrides. + +It uses this logic to first check if all pods can be removed from a node. If not, it will report +those nodes. Then it will group all pods by a priority and evict them gradually from the lowest +to highest priority. This may include DaemonSet pods. + +### kubelet + +Graceful Node Shutdown is a part of the current solution for node maintenance. Unfortunately, it is +not possible to rely solely on this feature as a go-to solution for graceful node and workload +termination. + +- The Graceful Node Shutdown feature is not application aware and may prematurely disrupt workloads + and lead to data loss. +- The kubelet controls the shutdown process using Dbus and systemd, and can delay (but not entirely + block) it using the systemd inhibitor. However, if Dbus or the kubelet is restarted during the + node shutdown, the shutdown might not be registered again, and pods might be terminated + ungracefully. Also, new workloads can get scheduled on the node while the node is shutting down. + Cluster admins should, therefore, plan the maintenance in advance and ensure that pods are + gracefully removed before attempting to shut down or restart the machine. +- The kubelet has no way of reliably detecting ongoing maintenance if the node is restarted in the + meantime. +- Graceful termination of static pods during a shutdown is not possible today. It is also not + currently possible to prevent them from starting back up immediately after the machine has been + restarted and the kubelet has started again, if the node is still under maintenance. + +### Motivation Summary + +To sum up. Some applications solve the disruption problem by introducing validating admission +webhooks. This has some drawbacks. The webhooks are not easily discoverable by cluster admins. And +they can block evictions for other applications if they are misconfigured or misbehave. The +eviction API is not intended to be extensible in this way. The webhook approach is therefore not +recommended. + +Some drainers solve the node drain by depending on the kubectl logic, or by extending/rewriting it +with additional rules and logic. + +As seen in the experience reports and GitHub issues, some admins solve their problems by simply +ignoring PDBs which can cause unnecessary disruptions or data loss. Some solve this by playing +with the application deployment, but have to understand that the application supports this. + +kubelet's Graceful Node Shutdown feature is a best-effort solution for unplanned shutdowns, but +it is not sufficient to ensure application and data safety. + +### Goals +- Introduce NodeMaintenance API. +- Introduce a node maintenance controller that creates EvictionRequests. +- Deprecate kubectl drain in favor of NodeMaintenance. Or at least print a warning. +- Make Graceful Node Shutdown prefer NodeMaintenance during a node shutdown as an opt-in feature + for a better reliability and application safety. +- Implement NodeMaintenanceAwareKubelet feature to implement a lifecycle for static pods during a + maintenance. +- Implement NodeMaintenanceAwareDaemonSet feature to prevent the scheduling of DaemonSet pods on + nodes during a maintenance. + +### Non-Goals +- Introduce a node maintenance period, nodeDrainTimeout (similar to [cluster-api](https://cluster-api.sigs.k8s.io/developer/architecture/controllers/control-plane) + nodeDrainTimeout) or a TTL optional field as an upper bound on the duration of node maintenance. + Then the node maintenance would be garbage collected and the node made schedulable again. +- Solve the node lifecycle management or automatic shutdown after the node drain is completed. + Implementation of this is better suited for other cluster components and actors who can use the + node maintenance as a building block to achieve their desired goals. +- Synchronize all pod termination mechanisms (see #8 in the [Motivation](#motivation) section), so + that they do not terminate pods under NodeMaintenance/EvictionRequests. + +## Proposal + +Most of these issues stem from the lack of a standardized way of detecting a start of the node +drain. This KEP proposes the introduction of a NodeMaintenance object that would signal an intent +to gracefully remove pods from given nodes. The intent will be implemented by the newly proposed +[EvictionRequests API KEP](https://github.com/kubernetes/enhancements/issues/4563), which ensures +graceful pod removal or migration, an ability to measure the progress and a fallback to eviction if +progress is lost. The NodeMaintenance implementation should also utilize existing node's +`.spec.unschedulable` field, which prevents new pods from being scheduled on such a node. + +We will deprecate the `kubectl drain` as the main mechanism for draining nodes and drive the whole +process via a declarative API. This API can be used either manually or programmatically by other +drain implementations (e.g. cluster autoscalers). + +To support workload migration, a new controller should be introduced to observe the NodeMaintenance +objects and then select pods for eviction. The pods should be selected by node (`nodeSelector`) and +evicted gradually by creating EvictionRequest objects according to the workload they are running. +Controllers can then implement the migration/termination either by reacting to the EvictionRequests +API or by reacting to the NodeMaintenance API if they need more details. + +### User Stories + +#### Story 1 + +As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without +any required manual interventions. I want to have an ability to manually switch between the +maintenance phases (Planning, Cordon, Drain, Drain Complete, Maintenance Complete). I also want to +observe the node drain via the API and check on its progress. I also want to be able to discover +workloads that are blocking the node drain. + +#### Story 2 + +As an application owner, I want to run single replica applications without disruptions and have the +ability to easily migrate the workload pods from one node to another. This also applies to +applications with larger number of replicas that prefer to surge (upscale) pods first rather than +downscale. + +#### Story 3 + +Cluster or node autoscalers that take on the role of `kubectl drain` want to signal the intent to +drain a node using the same API and provide a similar experience to the CLI counterpart. + +### Notes/Constraints/Caveats (Optional) + + + +- This KEP depends on [EvictionRequests API KEP](https://github.com/kubernetes/enhancements/issues/4563). + +### Risks and Mitigations + + + +A misconfigured .spec.nodeSelector could select all the nodes (or just all master nodes) in the +cluster. This can cause the cluster to get into a degraded and unrecoverable state. + +An admission plugin ([NodeMaintenance Admission](#nodemaintenance-admission)) is introduced to +issue a warning in this scenario. + +## Design Details + +### Kubectl + +`kubectl drain`: as we can see in the [Motivation](#motivation) section, kubectl is heavily used +either manually or as a library by other projects. It is safer to keep the old behavior of this +command. However, we will deprecate it along with all the library functions. We can print a +deprecation warning when this command is used, and promote the NodeMaintenance. Additionally, pods +that support eviction requests and have +`interceptor.evictionrequest.coordination.k8s.io/priority_${INTERCEPTOR_CLASS}` annotations could be +skipped when proceeding with the API-initiated eviction. + +`kubectl cordon` and `kubectl uncordon` commands will be enhanced with a warning that will warn +the user if a node is made un/schedulable, and it collides with an existing NodeMaintenance object. +As a consequence the node maintenance controller will reconcile the node back to the old value. +Because of this we can make these commands noop when the node is under the NodeMaintenance. + +### NodeMaintenance API + +NodeMaintenance objects serve as an intent to remove or migrate pods from a set of nodes. We will +include Cordon and Drain toggles to support the following states/stages of the maintenance: +1. Planning: this is to let the users know that maintenance will be performed on a particular set + of nodes in the future. Configured with `.spec.stage=Idle`. +2. Cordon: stop accepting (scheduling) new pods. Configured with `.spec.stage=Cordon`. +3. Drain: gives an intent to drain all selected nodes by creating `EvictionRequest` objects for the + node's pods. Configured with `.spec.stage=Drain`. +4. Drain Complete: all targeted pods have been drained from all the selected nodes. The nodes can + be upgraded, restarted, or shut down. The configuration is still kept at `.spec.stage=Drain` and + `Drained` condition is set to `"True"` on the node maintenance object. +5. Maintenance Complete: make the nodes schedulable again once the node maintenance is done. + Configured with `.spec.stage=Complete`. + +```golang + +// +enum +type NodeMaintenanceStage string + +const ( +// Idle does not interact with the cluster. +Idle NodeMaintenanceStage = "Idle" +// Cordon cordons all selected nodes by making them unschedulable. +Cordon NodeMaintenanceStage = "Cordon" +// Drain: +// 1. Cordons all selected nodes by making them unschedulable. +// 2. Gives an intent to drain all selected nodes by creating EvictionRequest objects for the +// node's pods. +Drain NodeMaintenanceStage = "Drain" +// Complete: +// 1. Removes all EvictionRequest objects requested by this NodeMaintenance. +// 2. Uncordons all selected nodes by making them schedulable again, unless there is not another +// maintenance in progress. +Complete NodeMaintenanceStage = "Complete" +) + +type NodeMaintenance struct { + ... + Spec NodeMaintenanceSpec + Status NodeMaintenanceStatus +} + +type NodeMaintenanceSpec struct { + // NodeSelector selects nodes for this node maintenance. + // +required + NodeSelector *v1.NodeSelector + + // The order of the stages is Idle -> Cordon -> Drain -> Complete. + // + // - The Cordon or Drain stage can be skipped by setting the stage to Complete. + // - The NodeMaintenance object is moved to the Complete stage on deletion unless the Idle stage has been set. + // + // The default value is Idle. + Stage NodeMaintenanceStage + + // DrainPlan is executed from the first entry to the last entry during the Drain stage. + // DrainPlanEntry podType fields should be in the following order: + // nil -> DaemonSet -> Static + // DrainPlanEntry priority fields should be in ascending order for each podType. + // If the priority and podType are the same, concrete selectors are executed first. + // + // The following entries are injected into the drainPlan on the NodeMaintenance admission: + // - podPriority: 1000000000 // highest priority for user defined priority classes + // podType: "Default" + // - podPriority: 2000000000 // system-cluster-critical priority class + // podType: "Default" + // - podPriority: 2000001000 // system-node-critical priority class + // podType: "Default" + // - podPriority: 2147483647 // maximum value + // podType: "Default" + // - podPriority: 1000000000 // highest priority for user defined priority classes + // podType: "DaemonSet" + // - podPriority: 2000000000 // system-cluster-critical priority class + // podType: "DaemonSet" + // - podPriority: 2000001000 // system-node-critical priority class + // podType: "DaemonSet" + // - podPriority: 2147483647 // maximum value + // podType: "DaemonSet" + // - podPriority: 1000000000 // highest priority for user defined priority classes + // podType: "Static" + // - podPriority: 2000000000 // system-cluster-critical priority class + // podType: "Static" + // - podPriority: 2000001000 // system-node-critical priority class + // podType: "Static" + // - podPriority: 2147483647 // maximum value + // podType: "Static" + // + // Duplicate entries are not allowed. + // This field is immutable. + DrainPlan []DrainPlanEntry + + // Reason for the maintenance. + Reason string +} + +const ( +// Default selects all pods except DaemonSet and Static pods. +Default PodType = "Default" +// DaemonSet selects DaemonSet pods. +DaemonSet PodType = "DaemonSet" +// Static selects static pods. +Static PodType = "Static" +) + +type DrainPlanEntry struct { + // PodSelector selects pods according to their labels. + // This can help to select which pods of the same priority should be evicted first. + // +optional + PodSelector *metav1.LabelSelector + // PodPriority specifies a pod priority. + // Pods with a priority less or equal to this value are selected. + PodPriority int32 + // PodType selects pods according to the pod type: + // - Default selects all pods except DaemonSet and Static pods. + // - DaemonSet selects DaemonSet pods. + // - Static selects static pods. + PodType PodType +} + +type NodeMaintenanceStatus struct { + // StageStatuses tracks the statuses of started stages. + StageStatuses []StageStatus + DrainStatus DrainStatus + Conditions []metav1.Condition +} + +type StageStatus struct { + // Name of the Stage. + Name NodeMaintenanceStage + // StartTimestamp is the time that indicates the start of this stage. + StartTimestamp *metav1.Time +} + +type DrainStatus struct { + // ReachedDrainTargets indicates which pods on all selected nodes are currently being targeted + // for eviction. Some of the nodes may have reached higher drain targets. This field tracks + // only the lowest drain targets among all nodes. Consult the status of each node to observe + // its current drain targets. + // + // Once eviction of the Default PodType finishes, DaemonSet PodType entries appear. + // Once the eviction of DaemonSet PodType finishes, Static PodType entries appear. + // The PodPriority for these entries is increased over time according to the .spec.DrainPlan + // as the lower-priority pods finish eviction. + // The next entry in the .spec.DrainPlan is selected once all the nodes have reached their + // DrainTargets. + // If there are multiple NodeMaintenances for a node, the least powerful DrainTargets among + // them are selected and set for that node. Thus, the DrainTargets do not have to correspond + // to the entries in .spec.drainPlan for a single NodeMaintenance instance. + // DrainTargets cannot backtrack and will target more pods with each update until all pods on + // the node are targeted. + ReachedDrainTargets []DrainPlanEntry + // Number of pods that have not been requested to terminate via EvictionRequest. + PodsPendingEvictionRequest int32 + // Number of active EvictionRequests with matching pods. + ActiveEvictionRequests int32 +} + +// NodeStatus is information about the current status of a node. +type NodeStatus struct { + ... + // MaintenanceStatus is present if a node is under a maintenance. This means there is at least + // one active NodeMaintenance object targeting this node. + MaintenanceStatus *MaintenanceStatus +} + +type MaintenanceStatus struct { + // DrainTargets specifies which pods on this node are currently being targeted for eviction. + // Once eviction of the Default PodType finishes, DaemonSet PodType entries appear. + // Once the eviction of DaemonSet PodType finishes, Static PodType entries appear. + // The PodPriority for these entries is increased over time according to the .spec.DrainPlan + // as the lower-priority pods finish eviction. + // The next entry in the .spec.DrainPlan is selected once all the nodes have reached their + // DrainTargets. + // If there are multiple NodeMaintenances for a node, the least powerful DrainTargets among + // them are selected and set for that node. Thus, the DrainTargets do not have to correspond + // to the entries in .spec.drainPlan for a single NodeMaintenance instance. + // DrainTargets cannot backtrack and will target more pods with each update until all pods on + // the node are targeted. + DrainTargets []DrainPlanEntry + // DrainMessage may specify a state of the drain on this node and a reason why the drain + // targets are set to a particular values. + DrainMessage string + // Number of pods that have not been requested to terminate via EvictionRequest. + PodsPendingEvictionRequest int32 + // Number of active EvictionRequests with matching pods. + ActiveEvictionRequests int32 +} + +const ( + // DrainedCondition is a condition set by the node-maintenance controller that signals + // whether all pods pending termination have terminated on all target nodes when drain is + // requested by the maintenance object. + DrainedCondition = "Drained" +} +``` + +### NodeMaintenance Admission + +`nodemaintenance` admission plugin will be introduced. + +It will validate all incoming requests for CREATE, UPDATE, and DELETE operations on the +NodeMaintenance objects. All nodes matching the `.spec.nodeSelector` must pass an authorization +check for the DELETE operation. + +Also, if the `.spec.nodeSelector` matches all cluster nodes, a warning will be produced indicating +that the cluster may get into a degraded and unrecoverable state. The warning is non-blocking and +such NodeMaintenance is still valid and can proceed. + +### NodeMaintenance Controller + +Node maintenance controller will be introduced and added to `kube-controller-manager`. It will +observe NodeMaintenance objects and have the following main features: + +#### Idle + +The controller should not touch the pods or nodes that match the selector of the NodeMaintenance +object in any way in the `Idle` stage. + +#### Finalizers and Deletion of the NodeMaintenance + +When a stage is not `Idle`, `nodemaintenance.k8s.io/maintenance-completion` finalizer is placed on +the NodeMaintenance object to ensure uncordon and removal of EvictionRequests upon deletion. + +When a deletion of the NodeMaintenance object is detected, its `.spec.stage` is set to `Complete`. +The finalizer is not removed until the `Complete` stage has been completed. + +#### Cordon + +When a `Cordon` or `Drain` stage is detected on the NodeMaintenance object, the controller +will set (and reconcile) `.spec.Unschedulable` to `true` on all nodes that satisfy +`.spec.nodeSelector`. It should alert via events if too many occur appear and a race to change +this field is detected. + +An alternative to prevent raciness is to make the scheduler aware of active NodeMaintenances and +not schedule new pods there. + +#### Uncordon (Complete) + +When a `Complete` stage is detected on the NodeMaintenance object, the controller sets +`.spec.Unschedulable` back to `false` on all nodes that satisfy `.spec.nodeSelector`, unless there +is no other maintenance in progress. + +When the node maintenance is canceled (reaches the `Complete` stage without all of its pods +terminating), the controller will attempt to remove all EvictionRequests that match the node maintenance, +unless there is no other maintenance in progress. +- If there are foreign finalizers on the EvictionRequest, it should only remove its own requester + finalizer (see [Drain](#drain)). +- If the interceptor does not support a cancellation and it has set + `.status.evictionRequestCancellationPolicy` to `Forbid`, deletion of the EvictionRequest object will not be + attempted. + +Consequences for pods: +1. Pods whose interceptors have not yet initiated eviction process will continue to run unchanged. +2. Pods whose interceptors have initiated eviction process and support cancellation + (`.status.evictionRequestCancellationPolicy=Allow`) should cancel the eviction and keep the pods + available. +3. Pods whose interceptors have initiated eviction process and do not support cancellation + (`.status.evictionRequestCancellationPolicy=Forbid`) should continue the eviction and eventually + terminate the pods. + +#### Drain + +When a `Drain` stage is detected on the NodeMaintenance object, EvictionRequest objects are created for +selected pods ([Pod Selection](#pod-selection)). + +```yaml +apiVersion: v1alpha1 +kind: EvictionRequest +metadata: + finalizers: + - requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io + name: f5823a89-e03f-4752-b013-445643b8c7a0-muffin-orders-6b59d9cb88-ks7wb + namespace: blue-deployment +spec: + podRef: + name: muffin-orders-6b59d9cb88-ks7wb + uid: f5823a89-e03f-4752-b013-445643b8c7a0 + progressDeadlineSeconds: 1800 + +``` + +This is resolved to the following EvictionRequest object according to the pod on admission: + +```yaml + +apiVersion: v1alpha1 +kind: EvictionRequest +metadata: + finalizers: + - requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io + labels: + app: muffin-orders + name: f5823a89-e03f-4752-b013-445643b8c7a0-muffin-orders-6b59d9cb88-ks7wb + namespace: blue-deployment +spec: + podRef: + name: muffin-orders-6b59d9cb88-ks7wb + uid: f5823a89-e03f-4752-b013-445643b8c7a0 + progressDeadlineSeconds: 1800 + interceptors: + - interceptorClass: deployment.apps.k8s.io + priority: 10000 + role: controller +``` + +The node maintenance controller requests the removal of a pod from a node by the presence of the +EvictionRequest. Setting `progressDeadlineSeconds` to 1800 (30m) should give potential interceptors +enough time to recover from a disruption and continue with the graceful eviction. If the +interceptors are unable to terminate the pod, or if there are no interceptors, the eviction request +controller will attempt to evict these pods, until they are deleted. + +The only job of the node maintenance controller is to make sure that the EvictionRequest object exist +and has the `requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io` finalizer. + +#### Pod Selection + +The pods to be evicted (by EvictionRequest) would first be selected by node (`.spec.nodeSelector`). +NodeMaintenance should eventually remove all the pods from each node. To do this in a graceful +manner, the controller will first ensure that lower priority pods are evicted/terminated first for +the same pod type. The user can also target some pods earlier than others with a label selector. + +DaemonSet and static pods typically run critical workloads that should be scaled down last. + +<<[UNRESOLVED Pod Selection Priority]>> +Should user daemon sets (priority up to 1000000000) be scaled down first? +<<[/UNRESOLVED]>> + + +To achieve this, we will ensure that the NodeMaintenance `.spec.drainPlan` always contains the +following entries: + +```yaml +spec: + drainPlan: + - podPriority: 1000000000 # highest priority for user defined priority classes + podType: "Default" + - podPriority: 2000000000 # system-cluster-critical priority class + podType: "Default" + - podPriority: 2000001000 # system-node-critical priority class + podType: "Default" + - podPriority: 2147483647 # maximum value + podType: "Default" + - podPriority: 1000000000 # highest priority for user defined priority classes + podType: "DaemonSet" + - podPriority: 2000000000 # system-cluster-critical priority class + podType: "DaemonSet" + - podPriority: 2000001000 # system-node-critical priority class + podType: "DaemonSet" + - podPriority: 2147483647 # maximum value + podType: "DaemonSet" + - podPriority: 1000000000 # highest priority for user defined priority classes + podType: "Static" + - podPriority: 2000000000 # system-cluster-critical priority class + podType: "Static" + - podPriority: 2000001000 # system-node-critical priority class + podType: "Static" + - podPriority: 2147483647 # maximum value + podType: "Static" + ... +``` + +If not they will be added during the NodeMaintenance admission. + +The node maintenance controller resolves this plan across intersecting NodeMaintenances. To indicate +which pods are being evicted on which node, the controller populates +`.status.maintenanceStatus.drainTargets` on each node object. It also populates +`.status.drainStatus.reachedDrainTargets` of the NodeMaintenance to track the lowest drain targets +among all nodes (pods that are being evicted everywhere). These status fields are updated during the +`Drain` stage to incrementally select pods with higher priority and pod type (`Default` +->`DaemonSet` -> `Static`). It is also possible to partition the updates for the same priorities +according to the pod labels. + +If there is only a single NodeMaintenance present, it selects the first entry from the +`.spec.drainPlan` and makes sure that all the targeted pods are terminated. It then selects +the next entry and repeats the process. If a new pod appears that matches the previous entries, it +will also be evicted. + +If there are multiple NodeMaintenances, we have to first resolve the lowest priority entry from the +`.spec.drainPlan` among them for the intersecting nodes. Non-intersecting nodes may have a higher +priority or pod type. The next entry in the plan can be selected once all the nodes of a +NodeMaintenance have finished eviction of all pods and all the NodeMaintenances of intersecting +nodes have finished eviction of pods for the current drain targets. See the +[Pod Selection and DrainTargets Example](#pod-selection-and-draintargets-example) for additional +details. + +A similar kind of drain plan, albeit with fewer features is offered today by the +[Graceful Node Shutdown](https://kubernetes.io/docs/concepts/architecture/nodes/#graceful-node-shutdown) +feature and by the Cluster Autoscaler's [drain-priority-config](https://github.com/kubernetes/autoscaler/pull/6139). +The downside of these configurations is that they have `shutdownGracePeriodSeconds` which sets a +limit on how long the termination of pods should take. This is not application-aware and some +applications may require more time to gracefully shut down. Allowing such hard-coded timeouts may +result in unnecessary application disruptions or data corruption. + +To support the eviction of `DaemonSet` and `Static` pods, the daemon set controller and kubelet +should observe NodeMaintenance objects and EvictionRequests to coordinate the scale down of the pods +on the targeted nodes. + +To ensure more streamlined experience we will not support the default kubectl [drain filters](https://github.com/kubernetes/kubernetes/blob/56cc5e77a10ba156694309d9b6159d4cd42598e1/staging/src/k8s.io/kubectl/pkg/drain/filters.go#L153-L162). +Instead, it should be possible to create the NodeMaintenance object with just a `spec.nodeSelector`. +The only thing that can be configured is which pods should be scaled down first. + +NodeMaintenance alternatives to kubectl drain filters: +- `daemonSetFilter`: Removal of these pods should be supported by the DaemonSet controller. +- `mirrorPodFilter`: Removal of these pods should be supported by the kubelet. +- `skipDeletedFilter`: Creating EvictionRequest of already terminating pods should have no downside + and be informative for the user. +- `unreplicatedFilter`: Actors who own pods without a controller owner reference should have the + opportunity to register an interceptor to gracefully terminate their pods. Many drain solutions + today evict these types of pods indiscriminately. +- `localStorageFilter`: Actors who own pods with local storage (having `EmptyDir` volumes) should + have the opportunity to register an interceptor to gracefully terminate their pods. Many drain + solutions today evict these types of pods indiscriminately. + +#### Pod Selection and DrainTargets Example + +If two Node Maintenances are created at the same time for the same node. Then, for the intersecting +nodes, the entry with the lowest priority in the drainPlan is resolved first. + +```yaml +apiVersion: v1alpha1 +kind: NodeMaintenance +metadata: + name: "maintenance-a" +spec: + nodeSelector: + # selects nodes one and two + stage: Drain + drainPlan: + - podPriority: 5000 + podType: Default + - podPriority: 15000 + podType: Default + - podPriority: 3000 + podType: DaemonSet + ... +status: + drainStatus: + podsPendingEvictionRequest: 130 + activeEvictionRequests: 17 + drainMessage: "Draining" + reachedDrainTargets: + - podPriority: 5000 + podType: Default + ... +--- +apiVersion: v1alpha1 +kind: NodeMaintenance +metadata: + name: "maintenance-b" +spec: + nodeSelector: + # selects nodes one and three + stage: Drain + drainPlan: + - podPriority: 10000 + podType: Default + - podPriority: 15000 + podType: Default + - podPriority: 4000 + podType: DaemonSet + ... +status: + drainStatus: + podsPendingEvictionRequest: 145 + activeEvictionRequests: 35 + drainMessage: "Draining (limited by maintenance-a)" + reachedDrainTargets: + - podPriority: 5000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "one" +status: + maintenanceStatus: + podsPendingEvictionRequest: 100 + activeEvictionRequests: 10 + drainMessage: "Draining" + drainTargets: + - podPriority: 5000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "two" +status: + maintenanceStatus: + podsPendingEvictionRequest: 30 + activeEvictionRequests: 7 + drainMessage: "Draining" + drainTargets: + - podPriority: 5000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "three" +status: + maintenanceStatus: + podsPendingEvictionRequest: 45 + activeEvictionRequests: 25 + drainMessage: "Draining" + drainTargets: + - podPriority: 10000 + podType: Default + ... +``` + +If the node three is drained, then it has to wait for the node one, because the drain plan +specifies that all the pods with priority 10000 or lower should be evicted first before moving on to +the next entry. + +```yaml +apiVersion: v1alpha1 +kind: NodeMaintenance +metadata: + name: "maintenance-b" +spec: + nodeSelector: + # selects nodes one and three + stage: Drain + drainPlan: + - podPriority: 10000 + podType: Default + - podPriority: 15000 + podType: Default + - podPriority: 4000 + podType: DaemonSet + ... +status: + drainStatus: + podsPendingEvictionRequest: 145 + activeEvictionRequests: 5 + drainMessage: "Draining (limited by maintenance-a)" + reachedDrainTargets: + - podPriority: 5000 + podType: Default +--- +apiVersion: v1 +kind: Node +metadata: + name: "three" +status: + maintenanceStatus: + podsPendingEvictionRequest: 45 + activeEvictionRequests: 0 + drainMessage: "Waiting for maintenance-a." + drainTargets: + - podPriority: 10000 + podType: Default + ... +``` + +If the node one is drained, we still have to wait for the `maintenance-a` to drain node two. If we +were to start evicting higher priority pods from node one earlier, we would not conform to the +drainPlan of `maintenance-a`. The plan specifies that all the pods with priority 5000 or lower +should be evicted first before moving on to the next entry. + + +```yaml +apiVersion: v1alpha1 +kind: NodeMaintenance +metadata: + name: "maintenance-a" +spec: + nodeSelector: + # selects nodes one and two + stage: Drain + drainPlan: + - podPriority: 5000 + podType: Default + - podPriority: 15000 + podType: Default + - podPriority: 3000 + podType: DaemonSet + ... +status: + drainStatus: + podsPendingEvictionRequest: 130 + activeEvictionRequests: 2 + drainMessage: "Draining" + reachedDrainTargets: + - podPriority: 5000 + podType: Default + ... +--- +apiVersion: v1alpha1 +kind: NodeMaintenance +metadata: + name: "maintenance-b" +spec: + nodeSelector: + # selects nodes one and three + stage: Drain + drainPlan: + - podPriority: 10000 + podType: Default + - podPriority: 15000 + podType: Default + - podPriority: 4000 + podType: DaemonSet + ... +status: + drainStatus: + podsPendingEvictionRequest: 145 + activeEvictionRequests: 0 + drainMessage: "Waiting for maintenance-a." + reachedDrainTargets: + - podPriority: 5000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "one" +status: + maintenanceStatus: + podsPendingEvictionRequest: 100 + activeEvictionRequests: 0 + drainMessage: "Waiting for maintenance-a." + drainTargets: + - podPriority: 5000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "two" +status: + maintenanceStatus: + podsPendingEvictionRequest: 30 + activeEvictionRequests: 2 + drainMessage: "Draining" + drainTargets: + - podPriority: 5000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "three" +status: + maintenanceStatus: + podsPendingEvictionRequest: 45 + activeEvictionRequests: 0 + drainMessage: "Waiting for maintenance-a." + drainTargets: + - podPriority: 10000 + podType: Default + ... +``` + +Once the node two drains, we can increment the drainTargets. + + +```yaml +apiVersion: v1alpha1 +kind: NodeMaintenance +metadata: + name: "maintenance-a" +spec: + nodeSelector: + # selects nodes one and two + stage: Drain + drainPlan: + - podPriority: 5000 + podType: Default + - podPriority: 15000 + podType: Default + - podPriority: 3000 + podType: DaemonSet + ... +status: + drainStatus: + podsPendingEvictionRequest: 91 + activeEvictionRequests: 39 + drainMessage: "Draining" + reachedDrainTargets: + - podPriority: 10000 + podType: Default + ... +--- +apiVersion: v1alpha1 +kind: NodeMaintenance +metadata: + name: "maintenance-b" +spec: + nodeSelector: + # selects nodes one and three + stage: Drain + drainPlan: + - podPriority: 10000 + podType: Default + - podPriority: 15000 + podType: Default + - podPriority: 4000 + podType: DaemonSet + ... +status: + drainStatus: + podsPendingEvictionRequest: 115 + activeEvictionRequests: 30 + drainMessage: "Draining" + reachedDrainTargets: + - podPriority: 10000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "one" +status: + maintenanceStatus: + podsPendingEvictionRequest: 70 + activeEvictionRequests: 30 + drainMessage: "Draining (limited by maintenance-b)" + drainTargets: + - podPriority: 10000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "two" +status: + maintenanceStatus: + podsPendingEvictionRequest: 21 + activeEvictionRequests: 9 + drainMessage: "Draining" + drainTargets: + - podPriority: 15000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "three" +status: + maintenanceStatus: + podsPendingEvictionRequest: 45 + activeEvictionRequests: 0 + drainMessage: "Waiting for maintenance-b." + drainTargets: + - podPriority: 10000 + podType: Default + ... +``` + +The progress of the drain should not be backtracked. If an intersecting `maintenance-c` is created, +the node one progress should stay the same regardless of the node maintenance drainPlan. + + +```yaml +apiVersion: v1alpha1 +kind: NodeMaintenance +metadata: + name: "maintenance-c" +spec: + nodeSelector: + # selects nodes one and four + stage: Drain + drainPlan: + - podPriority: 2000 + podType: Default + - podPriority: 15000 + podType: Default + ... +status: + drainStatus: + podsPendingEvictionRequest: 90 + activeEvictionRequests: 35 + drainMessage: "Draining" + reachedDrainTargets: + - podPriority: 2000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "one" +status: + maintenanceStatus: + podsPendingEvictionRequest: 70 + activeEvictionRequests: 30 + drainMessage: "Draining (limited by maintenance-b, maintenance-c)" + drainTargets: + - podPriority: 10000 + podType: Default + ... +--- +apiVersion: v1 +kind: Node +metadata: + name: "four" +status: + maintenanceStatus: + podsPendingEvictionRequest: 20 + activeEvictionRequests: 5 + drainMessage: "Draining" + drainTargets: + - podPriority: 2000 + podType: Default + ... +--- +``` + +This is done to ensure that the pre-conditions of the older maintenances (`maintenance-a` and +`maintenance-b`) are not broken. When we remove workloads with priority 15000, our pre-condition is +that workloads with priority 5000 that might depend on these 15000 priority workloads are gone. If +we allow rescheduling of the lower priority pods, this assumption is broken. + +Unfortunately, a similar precondition is broken for the `maintenance-c`, so we can at least emit an +event saying that we are fast-forwarding `maintenance-c` due to existing older maintenance(s). In +the extreme scenario, node one may already be turned off and creating a new maintenance that +assumes priority X pods are still running will not help to bring it back. Emitting an event would +help with observability and might help cluster admins better schedule node maintenances. + +##### PodTypes and Label Selectors Progression + +An example progression for the following drain plan might look as follows: + + +```yaml +spec: + stage: Drain + drainPlan: + - podPriority: 1000 + podType: Default + - podPriority: 2000 + podType: Default + podSelector: + matchLabels: + app: postgres + - podPriority: 2147483647 + podType: Default + - podPriority: 1000 + podType: DaemonSet + - podPriority: 2147483647 + podType: DaemonSet + - podPriority: 2147483647 + podType: Static +status: + nodeStatuses: + - nodeRef: + name: five + drainTargets: + - podPriority: 1000 + podType: Default + - podPriority: 1000 + podType: Default + podSelector: + matchLabels: + app: postgres + ... +``` + +```yaml +status: + nodeStatuses: + - nodeRef: + name: five + drainTargets: + - podPriority: 1000 + podType: Default + - podPriority: 2000 + podType: Default + podSelector: + matchLabels: + app: postgres + ... +``` + +```yaml +status: + nodeStatuses: + - nodeRef: + name: five + drainTargets: + - podPriority: 2147483647 + podType: Default + - podPriority: 2147483647 + podType: Default + podSelector: + matchLabels: + app: postgres + ... +``` + +```yaml +status: + nodeStatuses: + - nodeRef: + name: five + drainTargets: + - podPriority: 2147483647 + podType: Default + - podPriority: 2147483647 + podType: Default + podSelector: + matchLabels: + app: postgres + - podPriority: 1000 + podType: DaemonSet + ... +``` + +```yaml +status: + nodeStatuses: + - nodeRef: + name: five + drainTargets: + - podPriority: 2147483647 + podType: Default + - podPriority: 2147483647 + podType: Default + podSelector: + matchLabels: + app: postgres + - podPriority: 2147483647 + podType: DaemonSet + ... +``` + +```yaml +status: + nodeStatuses: + - nodeRef: + name: five + drainTargets: + - podPriority: 2147483647 + podType: Default + - podPriority: 2147483647 + podType: Default + podSelector: + matchLabels: + app: postgres + - podPriority: 2147483647 + podType: DaemonSet + - podPriority: 2147483647 + podType: Static + ... +``` + +#### Status + +The controller can show progress by reconciling: +- `.status.stageStatuses` should be amended when a new stage is selected. This is used to track + which stages have been started. Additional metadata can be added to this struct in the future. +- `.status.drainStatus.drainTargets` should be updated during a `Drain` stage. The drain + targets should be resolved according to the [Pod Selection](#pod-selection) and [Pod Selection and DrainTargets Example](#pod-selection-and-draintargets-example). +- `.status.drainStatus.drainMessage` should be updated during a `Drain` stage. The message + should be resolved according to [Pod Selection and DrainTargets Example](#pod-selection-and-draintargets-example). +- `.status.drainStatus.podsPendingEvictionRequest`, to indicate how many pods remain without a + matching EvictionRequest on the first node. +- `.status.drainStatus.activeEvictionRequests`, to indicate how many pods are being evicted from the + first node with EvictionRequests. Each EvictionRequest should match a pod that is not in a + terminal phase (`Succeeded` or `Failed`). +- To keep track of the entire maintenance the controller will reconcile a `Drained` condition and + set it to true if all pods pending eviction/termination have terminated on all target nodes + when drain is requested by the maintenance object. +- NodeMaintenance condition or annotation can be set on the node object to advertise the current + phase of the maintenance. +#### Supported Stage Transitions + +The following transitions should be validated by the API server. + +- Idle -> _Deletion_ + - Planning a maintenance in the future and canceling/deleting it without any consequence. +- (Idle) -> Cordon -> (Complete) -> _Deletion_. + - Make a set of nodes unschedulable and then schedulable again. + - The complete stage will always be run even without specifying it. +- (Idle) -> (Cordon) -> Drain -> (Complete) -> _Deletion_. + - Make a set of nodes unschedulable, drain them, and then make them schedulable again. + - Cordon and Complete stages will always be run, even without specifying them. +- (Idle) -> Complete -> _Deletion_. + - Make a set of nodes schedulable. + +The stage transitions are invoked either manually by the cluster admin or by a higher-level +controller. For a simple drain, cluster admin can simply create the NodeMaintenance with +`stage: Drain` directly. + +### DaemonSet Controller + +The DaemonSet workloads should be tied to the node lifecycle because they typically run critical +workloads where availability is paramount. Therefore, the DaemonSet controller should respond to +the EvictionRequest only if there is a NodeMaintenance happening on that node and the DaemonSet is in +the `drainTargets`. For example, if we observe the following NodeMaintenance: + +```yaml +apiVersion: v1alpha1 +kind: NodeMaintenance +... +status: + nodeStatuses: + - nodeRef: + name: six + drainTargets: + - podPriority: 2147483647 + podType: Default + - podPriority: 5000 + podType: DaemonSet + ... +``` + +To fulfil the EvictionRequest API, the DaemonSet controller should register itself as a controller +interceptor. To do this, it should ensure that the following annotation is present on its own pods. + +```yaml +interceptor.evictionrequest.coordination.k8s.io/priority_daemonset.apps.k8s.io: "10000/controller" +``` + +The controller should respond to the EvictionRequest object when it observes its own class +(`daemonset.apps.k8s.io`) in `.status.activeInterceptorClass`. + +For the above node maintenance, the controller should not react to EvictionRequests of DaemonSet +pods with a priority greater than 5000. This state should not normally occur, as EvictionRequest +should be coordinated with NodeMaintenance. If it does occur, we should not encourage this flow by +updating the `.status.ActiveInterceptorCompleted` field, although it is required to update this +field for normal workloads. + +If the DaemonSet pods have a priority equal to or less than 5000, the EvictionRequest status should be +updated appropriately as follows, and the targeted pod should be deleted by the DaemonSet +controller: + +```yaml +apiVersion: v1alpha1 +kind: EvictionRequest +metadata: + finalizers: + - requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io + labels: + app: critical-ds + name: ae9b4bc6-e4ca-4f8e-962b-2d4459b1f684-critical-ds-5nxjs + namespace: critical-workloads +spec: + podRef: + name: critical-ds-5nxjs + uid: ae9b4bc6-e4ca-4f8e-962b-2d4459b1f684 + progressDeadlineSeconds: 1800 + interceptors: + - interceptorClass: daemonset.apps.k8s.io + priority: 10000 + role: controller +status: + activeInterceptorClass: daemonset.apps.k8s.io + activeInterceptorCompleted: false + progressTimestamp: "2024-04-22T21:40:32Z" + expectedInterceptorFinishTime: "2024-04-22T21:41:32Z" # now + terminationGracePeriodSeconds: + failedAPIEvictionCounter: 0 + message: "critical-ds is terminating the pod due to node maintenance (OS upgrade)." + conditions: [] +``` + +Once the pod is terminated and removed from the node, it should not be re-scheduled on the node by +the DaemonSet controller until the node maintenance is complete. + +### kubelet: Graceful Node Shutdown + +The current Graceful Node Shutdown feature has a couple of drawbacks when compared to +NodeMaintenance: +- It is application agnostic as it only provides a static grace period before the shutdown based on + priority. This does not always give the application enough time to react and can lead to data + loss or application availability loss. +- The DaemonSet pods may be running important services (critical priority) that should be available + even during part of the shutdown. The daemon set controller does not have the observability of + the kubelet shutdown procedure and cannot infer which DaemonSets should stop running. The + controller needs to know which DaemonSets should run on each node with which priorities and + reconcile accordingly. + +To support these use cases we could introduce a new configuration option to the kubelet called +`preferNodeMaintenanceDuringGracefulShutdown`. + +This would result in the following behavior: + +When a shutdown is detected, the kubelet would create a NodeMaintenance object for that node. +Then it would block the shutdown indefinitely, until all the pods are terminated. The kubelet +could pass the priorities from the `shutdownGracePeriodByPodPriority` to the NodeMaintenance, +just without the `shutdownGracePeriodSeconds`. This would give applications a chance to react and +gracefully leave the node without a timeout. [Pod Selection](#pod-selection) would ensure that user +workloads are terminated first and critical pods are terminated last. + +By default, all user workloads will be asked to terminate at once. The EvictionRequest API ensures that +an interceptor is selected or an eviction API is called. This should result in a fast start of a pod +termination. NodeMaintenance could then be used even by spot instances. + +The NodeMaintenance object should survive kubelet restarts, and the kubelet would always know if +the node is under shutdown (maintenance). The cluster admin would have to remove the +NodeMaintenance object after the node restart to indicate that the node is healthy and can run pods +again. Admins are expected to deal with the lifecycle of planned NodeMaintenances, so reacting to +the unplanned one should not be a big issue. + +If there is no connection to the apiserver (apiserver down, network issues, etc.) and the +NodeMaintenance object cannot be created, we would fall back to the original behavior of Graceful +Node Shutdown feature. If the connection is restored, we would stop the Graceful Node Shutdown and +proceed with the NodeMaintenance. + +The NodeMaintenance would ensure that all pods are removed. This also includes the DaemonSet and +static pods. + +### kubelet: Static Pods + +Currently, there is no standard solution for terminating static pods. We can advertise what state +each node should be in, declaratively with NodeMaintenance. This can include static pods as well. + +Since static pods usually run the most critical workloads, they should be terminated last according +to [Pod Selection](#pod-selection). + +Similar to [DaemonSets](#daemonset-controller), static pods should be tied to the node lifecycle +because they typically run critical workloads where availability is paramount. Therefore, the +kubelet should respond to the EvictionRequest only if there is a NodeMaintenance happening on that node +and the `Static` pod is in the `drainTargets`. For example, if we observe the following +NodeMaintenance: + +```yaml +apiVersion: v1alpha1 +kind: NodeMaintenance +... +status: + nodeStatuses: + - nodeRef: + name: six + drainTargets: + - podPriority: 2147483647 + podType: Default + - podPriority: 2147483647 + podType: DaemonSet + - podPriority: 7000 + podType: Static + ... +``` + +To fulfil the EvictionRequest API, the DaemonSet controller should register itself as a controller +interceptor. To do this, it should ensure that the following annotation is present on its own pods. + +```yaml +interceptor.evictionrequest.coordination.k8s.io/priority_kubelet.k8s.io: "10000/controller" +``` + +The kubelet should respond to the EvictionRequest object when it observes its own class +(`kubelet.k8s.io`) in `.status.activeInterceptorClass`. + +For the above node maintenance, the kubelet should not react to EvictionRequests of static pods with +a priority greater than 7000. This state should not normally occur, as EvictionRequest should be +coordinated with NodeMaintenance. If it does occur, we should not encourage this flow by updating +the `.status.activeInterceptorCompleted` field, although it is required to update this field for +normal workloads. + +If the static pods have a priority equal to or less than 7000, the EvictionRequest status should be +updated appropriately as follows, and the targeted pod should be terminated by the kubelet: + +```yaml +apiVersion: v1alpha1 +kind: EvictionRequest +metadata: + finalizers: + - requester.evictionrequest.coordination.k8s.io/name_nodemaintenance.k8s.io + labels: + app: critical-static-workload + name: 08deef1c-1838-42a5-a3a8-3a6d0558c7f9-critical-static-workload + namespace: critical-workloads +spec: + podRef: + name: critical-static-workload + uid: 08deef1c-1838-42a5-a3a8-3a6d0558c7f9 + progressDeadlineSeconds: 1800 + interceptors: + - interceptorClass: kubelet.k8s.io + priority: 10000 + role: controller +status: + activeInterceptorClass: kubelet.k8s.io + activeInterceptorCompleted: false + progressTimestamp: "2024-04-22T22:10:05Z" + expectedInterceptorFinishTime: "2024-04-22T22:11:05Z" # now + terminationGracePeriodSeconds: + failedAPIEvictionCounter: 0 + message: "critical-static-workload is terminating the pod due to node maintenance (OS upgrade)." + conditions: [] +``` + +Once the pod is terminated and removed from the node, it should not be started on the node by +the kubelet again until the node maintenance is complete. + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate + - Feature gate name: DeclarativeNodeMaintenance - this feature gate enables the NodeMaintenance API and node + maintenance controller which creates `EvictionRequest` + - Components depending on the feature gate: kube-apiserver, kube-controller-manager + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [ ] Metrics + - Metric name: + - [Optional] Aggregation method: + - Components exposing the metric: +- [ ] Other (treat as last resort) + - Details: + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +- 2024-12-04: Evacuation API was renamed to EvictionRequest API. + +## Drawbacks + + + +## Alternatives + +### Out-of-tree Implementation + +We could implement the NodeMaintenance or EvictionRequest API out-of-tree first as a CRD. + +The KEP aims to solve graceful termination of any pod in the cluster. This is not possible with a +3rd party CRD as we need an integration with core components. + +- We would like to solve the lifecycle of static pods during a node maintenance. This means that + static pods should be terminated during the drain according to `drainPlan`, and they should stay + terminated after the kubelet restart if the node is still under maintenance. This requires + integration with kubelet. See [kubelet: Static Pods](#kubelet-static-pods) for more details. +- We would like to improve the Graceful Node Shutdown feature. Terminating pods via NodeMaintenance + will improve application safety and availability. It will also improve the reliability of the + Graceful Node Shutdown feature. However, this also requires the kubelet to interact with a + NodeMaintenance. See [kubelet](#kubelet) and + [kubelet: Graceful Node Shutdown](#kubelet-graceful-node-shutdown) for more details. +- We would like to also solve the lifecycle of DaemonSet pods during the NodeMaintenance. Usually + these pods run important or critical services. These should be terminated at the right time + during the node drain. To solve this, integration with NodeMaintenance is required. See + [DaemonSet Controller](#daemonset-controller) for more details. + +Also, one of the disadvantages of using a CRD is that it would be more difficult to get real-word +adoption and thus important feedback on this feature. This is mainly because the NodeMaintenance +feature coordinates the node drain and provides good observability of the whole process. +Third-party components that are both cluster admin and application developer facing can depend on +this feature, use it, and build on top of it. + +### Use a Node Object Instead of Introducing a New NodeMaintenance API + +As an alternative, it would be possible to signal the node maintenance by marking the node object +instead of introducing a new API. But, it is probably better to decouple this from the node for +reasons of extensibility and complexity. + +Advantages of the NodeMaintenance API approach: +- It allows us to implement incremental scale down of pods by various attributes according to a + drainPlan across multiple nodes. +- There may be additional features that can be added to the NodeMaintenance in the future. +- It helps to decouple RBAC permissions and general update responsibility from the node object. +- It is easier to manage a NodeMaintenance lifecycle compared to the node object. +- Two or more different actors may want to maintain the same node in two different overlapping time + slots. Creating two different NodeMaintenance objects would help with tracking each maintenance + along with the reason behind it. +- Observability is better achieved with an additional object. + +### Use Taint Based Eviction for Node Maintenance + +To signal the start of the eviction we could simply taint a node with the `NoExecute` taint. This +taint should be easily recognizable and have a standard name, such as +`node.kubernetes.io/maintenance`. Other actors could observe the creations of such a taint and +migrate or delete the pod. To ensure pods are not removed prematurely, application owners would +have to set a toleration on their pods for this maintenance taint. Such applications could also set +`.spec.tolerations[].tolerationSeconds`, which would give a deadline for the pods to be removed by +the NoExecuteTaintManager. + +This approach has the following disadvantages: +- Taints and tolerations do not support PDBs, which is the main mechanism for preventing voluntary + disruptions. People who want to avoid the disruptions caused by the maintenance taint would have + to specify the toleration in the pod definition and ensure it is present at all times. This would + also have an impact on the controllers, who would have to pollute the pod definitions with these + tolerations, even though the users did not specify them in their pod template. The controllers + could override users' tolerations, which the users might not be happy about. It is also hard to + make such behaviors consistent across all the controllers. +- Taints are used as a mechanism for involuntary disruption; to get pods out of the node for some + reason (e.g. node is not ready). Modifying the taint mechanism to be less harmful + (e.g. by adding a PDB support) is not possible due to the original requirements. +- It is not possible to incrementally scale down according to pod priorities, labels, etc. + +### Names considered for the new API + +These names are considered as an alternative to NodeMaintenance: + +- NodeIsolation +- NodeDetachment +- NodeClearance +- NodeQuarantine +- NodeDisengagement +- NodeVacation + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-apps/4212-declarative-node-maintenance/kep.yaml b/keps/sig-apps/4212-declarative-node-maintenance/kep.yaml new file mode 100644 index 00000000000..56a819a34e6 --- /dev/null +++ b/keps/sig-apps/4212-declarative-node-maintenance/kep.yaml @@ -0,0 +1,70 @@ +title: Declarative Node Maintenance +kep-number: 4212 +authors: + - "@atiratree" +owning-sig: sig-apps +participating-sigs: + - sig-apps + - sig-autoscaling + - sig-cli + - sig-cluster-lifecycle + - sig-node + - sig-scheduling +status: provisional +creation-date: 2023-09-15 +reviewers: + - "@adammw" + - "@alvaroaleman" + - "@aojea" + - "@dbenque" + - "@dchen1107" + - "@evrardjp" + - "@fabiand" + - "@fabriziopandini" + - "@kfox1111" + - "@kwilczynski" + - "@mmerkes" + - "@pwschuurman" + - "@razo7" + - "@rptaylor" + - "@rthallisey" + - "@sbueringer" + - "@sftim" + - "@soltysh" + - "@thockin" + - "@vincepri" + - "@wangzhen127" +approvers: + - "@dchen1107" + - "@fabriziopandini" + - "@soltysh" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.33" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DeclarativeNodeMaintenance + components: + - kube-apiserver + - kube-controller-manager + - name: NodeMaintenanceAwareKubelet + components: + - kubelet + - name: NodeMaintenanceAwareDaemonSet + components: + - kube-controller-manager +disable-supported: true + +## The following PRR answers are required at beta release +#metrics: