KEP-4212: Declarative Node Maintenance

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Future Improvements
- Disruption Controller: Eviction Protection and Observability
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes adding a declarative API to manage a node maintenance. This API can be used to implement additional capabilities around node draining.

Motivation

The goal of this KEP is to analyze and improve node maintenance in Kubernetes.

Node maintenance is a request from a cluster administrator to remove all pods from a node(s) so that it can be disconnected from the cluster to perform a software upgrade (OS, Kubelet), hardware upgrade, or simply to remove the node as it is no longer needed.

Kubernetes has existing support for this use case in the following way with kubectl drain:

There are running pods on node A, some of which are protected with PodDisruptionBudgets (PDB).
Set the node Unschedulable (cordon) to prevent new pods from being scheduled there.
Evict (default behavior) pods from node A by using the eviction API (see kubectl drain worklflow).
Proceed with the maintenance and shut down the node.
Kubelet can try to delay the shutdown to allow the remaining pods to terminate gracefully (graceful-node-shutdown). The Kubelet also takes pod priority into account (pod-priority-graceful-node-shutdown)

The main problem is that the current approach tries to solve this in an application agnostic way and just tries to get rid of all the pods on the node. Since this approach cannot be applied generically to all pods, the Kubernetes project has defined special drain filters that either skip groups of pods or an admin has to consent to override those groups to be either skipped or deleted. This means that without knowledge of all the underlying applications on the cluster, the admin has to make a potentially harmful decision.

From an application owner or developer perspective, the only standard tool they have is a PodDisruptionBudget. This is sufficient in a basic scenario with a simple multi-replica application. The edge case applications, where this does not work are very important to the cluster admin, as they can block the node drain. And, in turn, very important to the application owner, as the admin can then override the pod disruption budget and disrupt their sensitive application anyway.

List of cases where the current solution is not optimal:

Without extra manual effort, an application running with a single replica has to settle for experiencing application downtime during the node drain. They cannot use PDBs with minAvailable: 1 or maxUnavailable: 0, or they will block node maintenance. Not every user needs high availability either, due to a preference for a simpler deployment model, lack of application support for HA, or to minimize compute costs. Also, any automated solution needs to edit the PDB to account for the additional pod that needs to be spun to move the workload from one node to another. This has been discussed in issue kubernetes/kubernetes#66811 and in issue kubernetes/kubernetes#114877.
Similar to the first point, it is difficult to use PDBs for applications that can have a variable number of pods; for example applications with a configured horizontal pod autoscaler (HPA). These applications cannot be disrupted during a low load when they have only pod. However, it is possible to disrupt the pods during a high load without experiencing application downtime. If the minimum number of pods is 1, PDBs cannot be used without blocking the node drain. This has been discussed in issue kubernetes/kubernetes#93476
Graceful deletion of DaemonSet pods is currently only supported as part of (Linux) graceful node shutdown. The length of the shutdown is again not application specific and is set cluster-wide (optionally by priority) by the cluster admin. This does not take into account .spec.terminationGracePeriodSeconds of each pod and may cause premature termination of the application. This has been discussed in issue kubernetes/kubernetes#75482 and in issue kubernetes-sigs/cluster-api#6158.
There are cases during a node shutdown, when data corruption can occur due to premature node shutdown. It would be great if applications could perform data migration and synchronization of cached writes to the underlying storage before the pod deletion occurs. This is not easy to quantify even with pod's .spec.shutdownGracePeriod, as the time depends on the size of the data and the speed of the storage. This has been discussed in issue kubernetes/kubernetes#116618 and in issue kubernetes/kubernetes#115148.
There is not enough metadata about why the node drain was requested. This has been discussed in issue kubernetes/kubernetes#30586.

Approaches and workarounds used by other projects to deal with these shortcomings:

https://github.com/medik8s/node-maintenance-operator uses a declarative approach that tries to mimic kubectl drain (and uses kubectl implementation under the hood).
https://github.com/kubereboot/kured performs automatic node reboots and relies on kubectl drain implementation to achieve that.
https://github.com/strimzi/drain-cleaner prevents Kafka or ZooKeeper pods from being drained until they are fully synchronized. Implemented by intercepting eviction requests with a validating admission webhook. The synchronization is also protected by a PDB with the .spec.maxUnavailable field set to 0. See the experience reports for more information.
https://github.com/kubevirt/kubevirt intercepts eviction requests with a validating admission webhook to block eviction and to start a virtual machine live migration from one node to another. Normally, the workload is also guarded by a PDB with the .spec.minAvailable field set to 1. During the migration the value is increased to 2.

Experience Reports:

Federico Valeri, Drain Cleaner: What's this?, Sep 24, 2021, description of the use case and implementation of drain cleaner
Tommer Amber, Solution!! Avoid Kubernetes/Openshift Node Drain Failure due to active PodDisruptionBudget, Apr 30, 2022 - user is unhappy about the manual intervention required to perform node maintenance and gives the unfortunate advice to cluster admins to simply override the PDBs. This can have negative consequences for user applications, including data loss. This also discourages the use of PDBs. We have also seen an interest in issue kubernetes/kubernetes#83307 for overriding evictions, which led to the addition of the --disable-eviction flag to kubectl drain. There are other examples of this approach on the web .
Kevin Reeuwijk, How to handle blocking PodDisruptionBudgets on K8s with distributed storage, June 6, 2022 - a simple shell script example on how to drain the node in a safer way. It does a normal eviction, then looks for a pet application (Rook-Ceph in this case) and does hard delete if it does not see it. This approach is not plagued by the loss of data resiliency, but it does require maintenaning a list of pet applications, which can be prone to mistakes. In the end, the cluster admin has to do a job of the application maintainer.
Artur Rodrigues, Impossible Kubernetes node drains, 30 Mar, 2023 - discusses the problem with node drains and offers a workaround to restart the application without the application owner's consents, but acknowledges that this may be problematic without the knowledge of the application
Jack Roper, How to Delete Pods from a Kubernetes Node with Examples, 05 Jul, 2023 - also discusses the problem of blocking PDBs and offers several workarounds. Similar to others also offers a force deletion, but also a less destructive method of scaling up the application. However, this also interferes with application deployment and has to be supported by the application.

To sum up. Some projects solve this by introducing validating admission webhooks. This has a couple of disadvantages. The webhooks are not easily discoverable by cluster admins. And they can block evictions for other applications if they are misconfigured or misbehave. It is not intended for the eviction API to be extensible in this way. The webhook approach is therefore not recommended.

As seen in the experience reports and GitHub issues, some admins solve their problems by simply ignoring PDBs which can cause unnecessary disruptions or data loss. Some solve this by playing with the application deployment, but have to understand that the application supports this.

Goals

Kubectl drain should not evict and disrupt applications with evacuation capability and instead politely ask them to migrate their pods to another node or to remove them by creation of NodeMaintenance object.
Introduce a node maintenance controller that will help controllers like deployment controller to migrate their pods.
Deployment controller should use .spec.strategy.rollingUpdate.maxSurge to evacuate its pods from a node that is under maintenance.

Non-Goals

The PDB controller should detect and account for applications with evacuation capability when calculating PDB status.
Introduce a field that could include non-critical daemon set pods (priority system-cluster-critical or system-node-critical) for node maintenance/drain request. The daemon set controller would then gracefully shut down these pods. Critical pods could be overridden by the priority list mentioned below.
NodeMaintenance could include a plan of which pods to target first. Similar to graceful node shutdown, we could include a list of priorities to decide which pods should be terminated first. This list could optionally include pod timeouts, but could also wait for all the pods of a given priority class to finish first without a timeout. This could also be used to target daemon set pods of certain priorities (see point above). We could also introduce drain profiles based on these lists. The cluster admin could then choose or create such a profile based on his/her needs. The logic for processing the decision list would be contained in the node maintenance controller, which would set an intent to selected pods to shut down via the EvacuationRequest condition.
Introduce a node maintenance period, nodeDrainTimeout (similar to cluster-api nodeDrainTimeout) or a TTL optional field as an upper bound on the duration of node maintenance. Then the node maintenance would be garbage collected and the node made schedulable again.

Proposal

Most of these issues stem from missing a standardized way of detecting a start of the node drain. This KEP proposes the introduction of a NodeMaintenance object that would signal an intent to gracefully remove pods from given nodes. The application pods should then signal back that the pods are being removed or migrated from the node. The implementation should also utilize existing node's .spec.unschedulable feature, which prevents new pods from being scheduled on such a node.

We will focus primarily on kubectl drain as a consumer of the NodeMaintenance API, but it can also be used by other drain implementations (e.g. node autoscalers) or manually. We will first introduce the API and then later modify the behavior of the Kubernetes system to fix all the node drain issues we mentioned earlier.

To support workload migration, a new controller should be introduced to observe the NodeMaintenance objects and then mark pods for migration or removal with a condition. The pods would be selected according to the node at first (nodeSelector), but the selection mechanism can be extended later. Controllers can then implement the migration. The advantage of this approach is that controllers do not have to be aware of the NodeMaintenance object (no RBAC changes required). They only have to observe pods they own and react by migrating them. The first candidate is a deployment controller, since its workloads support surging to another node, which is the safest way to migrate. This would help to eliminate downtime not only for single replica applications, but for HA applications as well.

User Stories (Optional)

Story 1

As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without any required manual interventions. I want to have an ability to manually switch between the maintenance phases (Planning, Cordon, Drain, Drain Complete, Maintenance Complete). I also want to observe the node drain via the API and check on its progress. I also want to be able to discover workloads that are blocking the node drain.

Story 2

As an application owner, I want to run single replica applications without disruptions and have the ability to easily migrate the workload pods from one node to another.

Story 3

Cluster or node autoscalers that take on the role of kubectl drain want to signal the intent to drain a node using the same API and provide a similar experience to the CLI counterpart.

Story 4

I want to be able to use a similar approach for general descheduling of pods that happens outside of node maintenance.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

Design Details

Kubectl

kubectl drain command will be changed to create a NodeMaintenance object instead of marking the node unschedulable. We will also change the implementation to skip applications that support workload migration. This will be detected by observing a EvacuationRequest condition on the pod and the subsequent appearance of EvacuationInitiated condition within a reasonable timeframe (3m). At first only deployments with .spec.strategy.rollingUpdate.maxSurge value are expected to respond to this request. If the cluster doesn't support the NodeMaintenance API, kubectl will perform the node drain in a backwards compatible way.

kubectl cordon and kubectl uncordon commands will be enhanced with a warning that will warn the user if a node is made un/schedulable, and it collides with an existing NodeMaintenance object. As a consequence the node maintenance controller will reconcile the node back to the old value.

NodeMaintenance API

NodeMaintenance objects serve as an intent to remove or migrate pods from a set of nodes. We will include Cordon and Drain toggles to support the following phases of the maintenance:

Planning: this is to let the users know that maintenance will be performed on a particular set of nodes in the future. Configured with .spec.cordon=false and .spec.drain=false.
Cordon: stop accepting (scheduling) new pods. Configured with .spec.cordon=true and .spec.drain=false.-
Drain: gives an intent to drain all selected nodes by setting a EvacuationRequest condition with Reason="NodeMaintenance" on the node's pods. Configured with .spec.cordon=true and .spec.drain=true.
Drain Complete: all targeted pods have been drained from all the selected nodes. The nodes can be upgraded, restarted, or shut down. The configuration is still kept at .spec.cordon=true and .spec.drain=true.
Maintenance Complete: make the nodes schedulable again once the node maintenance is done. Set .spec.cordon=false and .spec.drain=false back again.

type NodeMaintenance struct {
    ...
    Spec NodeMaintenanceSpec
    Status NodeMaintenanceStatus
}

type NodeMaintenanceSpec struct {
    // +required
    NodeSelector *v1.NodeSelector
    // When set to true, cordons all selected nodes by making them unschedulable.
    Cordon  bool
    // When set to true, gives an intent to drain all selected nodes by setting
    // an EvacuationRequest condition on the node's pods.
    //
    // Drain cannot be set to true, unless Cordon is also set to true.
    Drain bool
    Reason string
}

type NodeMaintenanceStatus struct {
    // Mapping of a node name to the maintenance status.
    // +optional
    Nodes map[string]NodeMaintenanceNodeStatus
    Conditions []metav1.Condition
}

type NodeMaintenanceNodeStatus struct {
    // Number of pods this node maintenance is requesting to terminate on this node.
    PodsPendingEvacuation int32
    // Number of pods that have accepted the EvacuationRequest by reporting the EvacuationInitiated
    // pod condition and are therefore actively being evacuated or terminated.
    PodsEvacuating int32
}

const (
    // DrainedCondition is a condition set by the node-maintenance controller that signals
    // whether all pods pending termination have terminated on all target nodes when drain is
    // requested by the maintenance object.
    DrainedCondition = "Drained"
}

Pod API

We will introduce two new condition types:

EvacuationRequest condition should be set by a node maintenance controller on the pod to signal a request to evacuate the pod from the node. A reason should be given to identify the requester, in our case EvacuationByNodeMaintenance (similar to how DisruptionTarget condition behaves). The requester has the ability to withdraw the request by removing the condition or setting the condition status to False. Other controllers can also use this condition to request evacuation. For example, a descheduler could set this condition to True and give a EvacuationByDescheduler reason. Such a controller should not overwrite an existing request and should wait for either the pod deletion or removal of the evacuation request. The owning controller of the pod should observe the pod's conditions and respond to the EvacuationRequest by accepting it and setting an EvacuationInitiated condition to True in the pod conditions.
EvacuationInitiated condition should be set by the owning controller to signal that work is being done to either remove or evacuate/migrate the pod to another node. The draining process/controller should wait a reasonable amount of time (3 minutes) to observe the appearance of the condition or change of the condition status to True. The draining process should then skip such a pod and leave its management to the owning controller. If EvacuationInitiated condition does not appear after 3 minutes, the draining process will begin evicting or deleting the pod. If the owning controller is unable to remove or migrate the pod, it should set the EvacuationInitiated condition status back to False to give the eviction a chance to start.

type PodConditionType string

const (
    ...
    EvacuationRequest PodConditionType = "EvacuationRequest"
    EvacuationStarted PodConditionType = "EvacuationInitiated"
)

const (
    ...
    PodReasonNodeMaintenance = "NodeMaintenance"
)

NodeMaintenance Controller

Node maintenance controller will be introduced and added to kube-controller-manager. It will observe NodeMaintenance objects and have the following two main features:

Cordon

When a true value is detected on .spec.cordon of the NodeMaintenance object, the controller will set .spec.Unschedulable to true on all nodes that satisfy .spec.nodeSelector. On the other hand if a false value is detected or the NodeMaintenance object is removed. the controller will set .spec.Unschedulable back to false.

Drain

Prerequisite for Drain is a complete Cordon. This is also enforced on the API level.

When a true value is detected on .spec.drain of the NodeMaintenance object, the EvacuationRequest condition is set on selected pods. The condition should have Reason="NodeMaintenance" and message equal to .spec.reason of the NodeMaintenance object. The pods would be selected according to the node (.spec.nodeSelector) and a subset of the default kubectl drain filters.

Used drain filters:

daemonSetFilter, skips daemon sets to keep critical workloads alive.
mirrorPodFilter, skips static mirror pods.

Omitted drain filters:

skipDeletedFilter: updating the condition of already terminating pods should have no downside and will be informative for the user.
unreplicatedFilter: actors who own pods without a controller owner reference should have the opportunity to evacuate their pods. It is a noop if the owner does not respond.
localStorageFilter, we can leave the responsibility of whether to evacuate a pod with local storage (having EmptyDir volumes) to the owning workload. For example, a controller of a deployment that has a .spec.strategy.rollingUpdate.maxSurge defined assumes that it is safe to remove the pod and the EmptyDir volume.

The selection process can be later enhanced to target daemon set pods according to the priority or pod type.

Controllers that own these marked pods, would observe them and start a removal or migration from the nodes upon detecting the EvacuationRequest condition. They will also indicate this by setting the EvacuationInitiated condition on the pod.

The node maintenance controller would also remove the EvacuationRequest condition from the targeted pods if the NodeMaintenance object is removed prematurely or if .spec.Drainis set back to false. The condition will only be removed if the reason of the condition is NodeMaintenance. If the reason has a different value, then it is owned by another controller (e.g. descheduler) and we should keep the condition.

The controller can show progress by reconciling:

.status.nodes["worker-1"].PodsPendingEvacuation, to show how many pods remain to be removed from the node "worker-1".
.status.nodes["worker-1"].PodsEvacuating, to show how many pods have been accepted for the evacuation from the node "worker-1". These are the pods that have the EvacuationInitiated condition set to True.
To keep track of the entire maintenance the controller will reconcile a Drained condition and set it to true if all pods pending evacuation/termination have terminated on all target nodes when drain is requested by the maintenance object.
NodeMaintenance condition or annotation can be set on the node object to advertise the current phase of the maintenance.

Deployment and ReplicaSet Controllers

The replica set controller will watch its pods and count the number of pods it observes with a EvacuationRequest condition. It will then store this count in .status.ReplicasToEvacuate

The deployment controller will watch its ReplicaSets and react when it observes positive number of pods in .status.ReplicasToEvacuate. If the owning object of the targeted pods is a Deployment with a positive .spec.strategy.rollingUpdate.maxSurge value, the controller will create surge pods by scaling up the ReplicaSet. The new pods will not be scheduled on the maintained node because the .spec.unschedulable field would be set to true on that node. As soon as the surge pods become available, the deployment controller will scale down the ReplicaSet. The replica set controller will then in turn delete the pods with the EvacuationRequest condition.

For completeness, the deployment controller will also track the total number of targeted pods of all its ReplicaSets under its .status.ReplicasToEvacuate.

If the node maintenance prematurely ends before the surge process has a chance to complete, the deployment controller will scale down the ReplicaSet which will then remove the extra pods that were created during the surge.

type ReplicaSetStatus struct {
    ...
    ReplicasToEvacuate int32
    ...
}

type DeploymentStatus struct {
    ...
    ReplicasToEvacuate int32
    ...
}

To support providing a response to the drain process that the evacuation has begun. The deployment controller will annotate all replica sets that support the evacuation with the annotation deployment.kubernetes.io/evacuation-ready. For now this will apply to replica sets of deployments with .spec.strategy.rollingUpdate.maxSurge. When this annotation is present, the replication controller will respond to all evacuation requests by setting the EvacuationInitiated condition to all of its pods with the EvacuationRequest condition.

Node Drain Process: Before and After

The following diagrams describe how the node drain process will change in respective to each component.

Current state of node drain:

Proposed node drain:

Test Plan

[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

<package>: <date> - <test coverage>

Integration tests

:

e2e tests

:

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate
- Feature gate name: DeclarativeNodeMaintenance - this feature gate enables the NodeMaintenance API and node maintenance controller which sets EvacuationRequest condition on pods
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
- Feature gate name: NodeMaintenanceDeployment - this feature gate enables pod surging in deployment controller when EvacuationRequest condition appears.
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
Other
- Describe the mechanism: changes to kubectl drain, kubectl cordon and kubectl uncordon will be behind an alpha env variable called KUBECTL_ENABLE_DECLARATIVE_NODE_MAINTENANCE
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Future Improvements

Disruption Controller: Eviction Protection and Observability

With the .spec.MinAvailable and .spec.MaxUnavailable options, PDB could unfortunately allow surge pods to be evicted because .status.CurrentHealthy could become higher than .status.desiredHealthy. This could disrupt ongoing migration of an application. To support mission-critical applications, we might do one of the following:

Consider pods with an EvacuationInitiated condition as disrupted and thus decreasing .status.CurrentHealthy.
Or increase the .status.desiredHealthy by the number of pods with the EvacuationInitiated condition. This would keep the .status.DisruptionsAllowed value low enough not to disrupt the migration.

We can also count the evacuating pods in the status for observability, as this feature changes the behavior of the PodDisruptionBudget status.

type PodDisruptionBudgetStatus struct {
    ...
    // total number of pods that are evacuating and expected to be terminated
    EvacuatingPods int32
    ...
}

Alternatives

Out-of-tree Implementation

We could implement NodeMainentance API out-of-tree first as a CRD with node maintenance controller.

One of the problems is that it would be difficult to get real word adoption and thus important feedback on this feature. This is mainly due to the requirement that the feature be implemented and integrated with multiple components to observe the benefits. And those components are both admin and application developer facing.

There is a Node Maintenance Operator project that provides a similar API and has some adoption. But it is not at a level where applications could depend on this API to be present in the cluster. So it doesn't make that much sense to implement the app migration logic as it cannot be applied everywhere. As shown in the motivation part of this KEP, there is a big appetite for having a unified and stable API that could be used by everyone to implement the new capabilities.

Use a Node Object Instead of Introducing a New NodeMaintenance API

As an alternative, it would be possible to signal the node maintenance by marking the node object instead of introducing a new API. But it is probably better to decouple this from the node for extensibility reasons. As we can see, the kubectl drain logic is a bit complex, and it may be possible to move this logic to a controller in the future and make the node maintenance purely declarative.

Additional benefits of the NodeMaintenance API approach:

It helps to decouple RBAC permissions from the node object.
Two or more different actors may want to maintain the same node in two different overlapping time slots. Creating two different NodeMaintenance objects would help with tracking each maintenance together with the reason behind it.

Use Taint Based Eviction for Node Maintenance

To signal the start of the eviction we could simply taint a node with the NoExecute taint. This taint should be easily recognizable and have a standard name, such as node.kubernetes.io/maintenance. Other actors could observe the creations of such a taint and migrate or delete the pod. To ensure pods are not removed prematurely, application owners would have to set a toleration on their pods for this maintenance taint. Such applications could also set .spec.tolerations[].tolerationSeconds, which would give a deadline for the pods to be removed by the NoExecuteTaintManager.

This approach has the following disadvantages:

Taints and tolerations do not support PDBs, which is the main mechanism for preventing voluntary disruptions. People who want to avoid the disruptions caused by the maintenance taint would have to specify the toleration in the pod definition and ensure it is present at all times. This would also have an impact on the controllers, who would have to pollute the pod definitions with these tolerations, even though the users did not specify them in their pod template. The controllers could override users' tolerations, which the users might not be happy about. It is also hard to make such behaviors consistent across all the controllers.
Taints are used as a mechanism for involuntary disruption; to get pods out of the node for some reason (e.g. node is not ready). Modifying the taint mechanism to be less harmful (e.g. by adding a PDB support) is not possible due to the original requirements.

Names considered for the new API

These names are considered as an alternative to NodeMaintenance:

NodeIsolation
NodeDetachment
NodeClearance
NodeQuarantine
NodeDisengagement
NodeVacation

Files

README.md

Latest commit

History