- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Future Improvements
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This KEP proposes adding a declarative API to manage a node maintenance. This API can be used to implement additional capabilities around node draining.
The goal of this KEP is to analyze and improve node maintenance in Kubernetes.
Node maintenance is a request from a cluster administrator to remove all pods from a node(s) so that it can be disconnected from the cluster to perform a software upgrade (OS, Kubelet), hardware upgrade, or simply to remove the node as it is no longer needed.
Kubernetes has existing support for this use case in the following way with kubectl drain
:
- There are running pods on node A, some of which are protected with PodDisruptionBudgets (PDB).
- Set the node
Unschedulable
(cordon) to prevent new pods from being scheduled there. - Evict (default behavior) pods from node A by using the eviction API (see kubectl drain worklflow).
- Proceed with the maintenance and shut down the node.
- Kubelet can try to delay the shutdown to allow the remaining pods to terminate gracefully (graceful-node-shutdown). The Kubelet also takes pod priority into account (pod-priority-graceful-node-shutdown)
The main problem is that the current approach tries to solve this in an application agnostic way and just tries to get rid of all the pods on the node. Since this approach cannot be applied generically to all pods, the Kubernetes project has defined special drain filters that either skip groups of pods or an admin has to consent to override those groups to be either skipped or deleted. This means that without knowledge of all the underlying applications on the cluster, the admin has to make a potentially harmful decision.
From an application owner or developer perspective, the only standard tool they have is a PodDisruptionBudget. This is sufficient in a basic scenario with a simple multi-replica application. The edge case applications, where this does not work are very important to the cluster admin, as they can block the node drain. And, in turn, very important to the application owner, as the admin can then override the pod disruption budget and disrupt their sensitive application anyway.
List of cases where the current solution is not optimal:
- Without extra manual effort, an application running with a single replica has to settle for
experiencing application downtime during the node drain. They cannot use PDBs with
minAvailable: 1
ormaxUnavailable: 0
, or they will block node maintenance. Not every user needs high availability either, due to a preference for a simpler deployment model, lack of application support for HA, or to minimize compute costs. Also, any automated solution needs to edit the PDB to account for the additional pod that needs to be spun to move the workload from one node to another. This has been discussed in issue kubernetes/kubernetes#66811 and in issue kubernetes/kubernetes#114877. - Similar to the first point, it is difficult to use PDBs for applications that can have a variable number of pods; for example applications with a configured horizontal pod autoscaler (HPA). These applications cannot be disrupted during a low load when they have only pod. However, it is possible to disrupt the pods during a high load without experiencing application downtime. If the minimum number of pods is 1, PDBs cannot be used without blocking the node drain. This has been discussed in issue kubernetes/kubernetes#93476
- Graceful deletion of DaemonSet pods is currently only supported as part of (Linux) graceful node
shutdown. The length of the shutdown is again not application specific and is set cluster-wide
(optionally by priority) by the cluster admin. This does not take into account
.spec.terminationGracePeriodSeconds
of each pod and may cause premature termination of the application. This has been discussed in issue kubernetes/kubernetes#75482 and in issue kubernetes-sigs/cluster-api#6158. - There are cases during a node shutdown, when data corruption can occur due to premature node
shutdown. It would be great if applications could perform data migration and synchronization of
cached writes to the underlying storage before the pod deletion occurs. This is not easy to
quantify even with pod's
.spec.shutdownGracePeriod
, as the time depends on the size of the data and the speed of the storage. This has been discussed in issue kubernetes/kubernetes#116618 and in issue kubernetes/kubernetes#115148. - There is not enough metadata about why the node drain was requested. This has been discussed in issue kubernetes/kubernetes#30586.
Approaches and workarounds used by other projects to deal with these shortcomings:
- https://github.com/medik8s/node-maintenance-operator uses a declarative approach that tries to
mimic
kubectl drain
(and uses kubectl implementation under the hood). - https://github.com/kubereboot/kured performs automatic node reboots and relies on
kubectl drain
implementation to achieve that. - https://github.com/strimzi/drain-cleaner prevents Kafka or ZooKeeper pods from being drained
until they are fully synchronized. Implemented by intercepting eviction requests with a
validating admission webhook. The synchronization is also protected by a PDB with the
.spec.maxUnavailable
field set to 0. See the experience reports for more information. - https://github.com/kubevirt/kubevirt intercepts eviction requests with a validating admission
webhook to block eviction and to start a virtual machine live migration from one node to another.
Normally, the workload is also guarded by a PDB with the
.spec.minAvailable
field set to 1. During the migration the value is increased to 2.
Experience Reports:
- Federico Valeri, Drain Cleaner: What's this?, Sep 24, 2021, description of the use case and implementation of drain cleaner
- Tommer Amber, Solution!! Avoid Kubernetes/Openshift Node Drain Failure due to active PodDisruptionBudget, Apr 30, 2022 - user
is unhappy about the manual intervention required to perform node maintenance and gives the
unfortunate advice to cluster admins to simply override the PDBs. This can have negative
consequences for user applications, including data loss. This also discourages the use of PDBs.
We have also seen an interest in issue kubernetes/kubernetes#83307
for overriding evictions, which led to the addition of the
--disable-eviction
flag tokubectl drain
. There are other examples of this approach on the web . - Kevin Reeuwijk, How to handle blocking PodDisruptionBudgets on K8s with distributed storage, June 6, 2022 - a simple shell script example on how to drain the node in a safer way. It does a normal eviction, then looks for a pet application (Rook-Ceph in this case) and does hard delete if it does not see it. This approach is not plagued by the loss of data resiliency, but it does require maintenaning a list of pet applications, which can be prone to mistakes. In the end, the cluster admin has to do a job of the application maintainer.
- Artur Rodrigues, Impossible Kubernetes node drains, 30 Mar, 2023 - discusses the problem with node drains and offers a workaround to restart the application without the application owner's consents, but acknowledges that this may be problematic without the knowledge of the application
- Jack Roper, How to Delete Pods from a Kubernetes Node with Examples, 05 Jul, 2023 - also discusses the problem of blocking PDBs and offers several workarounds. Similar to others also offers a force deletion, but also a less destructive method of scaling up the application. However, this also interferes with application deployment and has to be supported by the application.
To sum up. Some projects solve this by introducing validating admission webhooks. This has a couple of disadvantages. The webhooks are not easily discoverable by cluster admins. And they can block evictions for other applications if they are misconfigured or misbehave. It is not intended for the eviction API to be extensible in this way. The webhook approach is therefore not recommended.
As seen in the experience reports and GitHub issues, some admins solve their problems by simply ignoring PDBs which can cause unnecessary disruptions or data loss. Some solve this by playing with the application deployment, but have to understand that the application supports this.
- Kubectl drain should not evict and disrupt applications with evacuation capability and instead politely ask them to migrate their pods to another node or to remove them by creation of NodeMaintenance object.
- Introduce a node maintenance controller that will help controllers like deployment controller to migrate their pods.
- Deployment controller should use
.spec.strategy.rollingUpdate.maxSurge
to evacuate its pods from a node that is under maintenance.
- The PDB controller should detect and account for applications with evacuation capability when calculating PDB status.
- Introduce a field that could include non-critical daemon set pods
(priority
system-cluster-critical
orsystem-node-critical
) for node maintenance/drain request. The daemon set controller would then gracefully shut down these pods. Critical pods could be overridden by the priority list mentioned below. - NodeMaintenance could include a plan of which pods to target first. Similar to graceful node shutdown, we could include a list of priorities to decide which pods should be terminated first. This list could optionally include pod timeouts, but could also wait for all the pods of a given priority class to finish first without a timeout. This could also be used to target daemon set pods of certain priorities (see point above). We could also introduce drain profiles based on these lists. The cluster admin could then choose or create such a profile based on his/her needs. The logic for processing the decision list would be contained in the node maintenance controller, which would set an intent to selected pods to shut down via the EvacuationRequest condition.
- Introduce a node maintenance period, nodeDrainTimeout (similar to cluster-api nodeDrainTimeout) or a TTL optional field as an upper bound on the duration of node maintenance. Then the node maintenance would be garbage collected and the node made schedulable again.
Most of these issues stem from missing a standardized way of detecting a start of the node drain.
This KEP proposes the introduction of a NodeMaintenance object that would signal an intent to
gracefully remove pods from given nodes. The application pods should then signal back that the pods
are being removed or migrated from the node. The implementation should also utilize existing node's
.spec.unschedulable
feature, which prevents new pods from being scheduled on such a node.
We will focus primarily on kubectl drain
as a consumer of the NodeMaintenance API, but it can
also be used by other drain implementations (e.g. node autoscalers) or manually. We will first
introduce the API and then later modify the behavior of the Kubernetes system to fix all the node
drain issues we mentioned earlier.
To support workload migration, a new controller should be introduced to observe the NodeMaintenance
objects and then mark pods for migration or removal with a condition. The pods would be selected
according to the node at first (nodeSelector
), but the selection mechanism can be extended later.
Controllers can then implement the migration. The advantage of this approach is that controllers do
not have to be aware of the NodeMaintenance object (no RBAC changes required). They only have to
observe pods they own and react by migrating them. The first candidate is a deployment controller,
since its workloads support surging to another node, which is the safest way to migrate. This would
help to eliminate downtime not only for single replica applications, but for HA applications as
well.
As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without any required manual interventions. I want to have an ability to manually switch between the maintenance phases (Planning, Cordon, Drain, Drain Complete, Maintenance Complete). I also want to observe the node drain via the API and check on its progress. I also want to be able to discover workloads that are blocking the node drain.
As an application owner, I want to run single replica applications without disruptions and have the ability to easily migrate the workload pods from one node to another.
Cluster or node autoscalers that take on the role of kubectl drain
want to signal the intent to
drain a node using the same API and provide a similar experience to the CLI counterpart.
I want to be able to use a similar approach for general descheduling of pods that happens outside of node maintenance.
kubectl drain
command will be changed to create a NodeMaintenance object instead of marking the
node unschedulable. We will also change the implementation to skip applications that support
workload migration. This will be detected by observing a EvacuationRequest
condition on the pod
and the subsequent appearance of EvacuationInitiated
condition within a reasonable timeframe (3m).
At first only deployments with .spec.strategy.rollingUpdate.maxSurge
value are expected to
respond to this request. If the cluster doesn't support the NodeMaintenance API, kubectl will
perform the node drain in a backwards compatible way.
kubectl cordon
and kubectl uncordon
commands will be enhanced with a warning that will warn
the user if a node is made un/schedulable, and it collides with an existing NodeMaintenance object.
As a consequence the node maintenance controller will reconcile the node back to the old value.
NodeMaintenance objects serve as an intent to remove or migrate pods from a set of nodes. We will include Cordon and Drain toggles to support the following phases of the maintenance:
- Planning: this is to let the users know that maintenance will be performed on a particular set
of nodes in the future. Configured with
.spec.cordon=false
and.spec.drain=false
. - Cordon: stop accepting (scheduling) new pods. Configured with
.spec.cordon=true
and.spec.drain=false
.- - Drain: gives an intent to drain all selected nodes by setting a
EvacuationRequest
condition withReason="NodeMaintenance"
on the node's pods. Configured with.spec.cordon=true
and.spec.drain=true
. - Drain Complete: all targeted pods have been drained from all the selected nodes. The nodes can
be upgraded, restarted, or shut down. The configuration is still kept at
.spec.cordon=true
and.spec.drain=true
. - Maintenance Complete: make the nodes schedulable again once the node maintenance is done.
Set
.spec.cordon=false
and.spec.drain=false
back again.
type NodeMaintenance struct {
...
Spec NodeMaintenanceSpec
Status NodeMaintenanceStatus
}
type NodeMaintenanceSpec struct {
// +required
NodeSelector *v1.NodeSelector
// When set to true, cordons all selected nodes by making them unschedulable.
Cordon bool
// When set to true, gives an intent to drain all selected nodes by setting
// an EvacuationRequest condition on the node's pods.
//
// Drain cannot be set to true, unless Cordon is also set to true.
Drain bool
Reason string
}
type NodeMaintenanceStatus struct {
// Mapping of a node name to the maintenance status.
// +optional
Nodes map[string]NodeMaintenanceNodeStatus
Conditions []metav1.Condition
}
type NodeMaintenanceNodeStatus struct {
// Number of pods this node maintenance is requesting to terminate on this node.
PodsPendingEvacuation int32
// Number of pods that have accepted the EvacuationRequest by reporting the EvacuationInitiated
// pod condition and are therefore actively being evacuated or terminated.
PodsEvacuating int32
}
const (
// DrainedCondition is a condition set by the node-maintenance controller that signals
// whether all pods pending termination have terminated on all target nodes when drain is
// requested by the maintenance object.
DrainedCondition = "Drained"
}
We will introduce two new condition types:
EvacuationRequest
condition should be set by a node maintenance controller on the pod to signal a request to evacuate the pod from the node. A reason should be given to identify the requester, in our caseEvacuationByNodeMaintenance
(similar to howDisruptionTarget
condition behaves). The requester has the ability to withdraw the request by removing the condition or setting the condition status toFalse
. Other controllers can also use this condition to request evacuation. For example, a descheduler could set this condition toTrue
and give aEvacuationByDescheduler
reason. Such a controller should not overwrite an existing request and should wait for either the pod deletion or removal of the evacuation request. The owning controller of the pod should observe the pod's conditions and respond to theEvacuationRequest
by accepting it and setting anEvacuationInitiated
condition toTrue
in the pod conditions.EvacuationInitiated
condition should be set by the owning controller to signal that work is being done to either remove or evacuate/migrate the pod to another node. The draining process/controller should wait a reasonable amount of time (3 minutes) to observe the appearance of the condition or change of the condition status toTrue
. The draining process should then skip such a pod and leave its management to the owning controller. IfEvacuationInitiated
condition does not appear after 3 minutes, the draining process will begin evicting or deleting the pod. If the owning controller is unable to remove or migrate the pod, it should set theEvacuationInitiated
condition status back toFalse
to give the eviction a chance to start.
type PodConditionType string
const (
...
EvacuationRequest PodConditionType = "EvacuationRequest"
EvacuationStarted PodConditionType = "EvacuationInitiated"
)
const (
...
PodReasonNodeMaintenance = "NodeMaintenance"
)
Node maintenance controller will be introduced and added to kube-controller-manager
. It will
observe NodeMaintenance objects and have the following two main features:
When a true
value is detected on .spec.cordon
of the NodeMaintenance object, the controller
will set .spec.Unschedulable
to true
on all nodes that satisfy .spec.nodeSelector
. On the
other hand if a false
value is detected or the NodeMaintenance object is removed. the controller
will set .spec.Unschedulable
back to false
.
Prerequisite for Drain
is a complete Cordon
. This is also enforced on the API level.
When a true
value is detected on .spec.drain
of the NodeMaintenance object, the
EvacuationRequest
condition is set on selected pods. The condition should have
Reason="NodeMaintenance"
and message equal to .spec.reason
of the NodeMaintenance object.
The pods would be selected according to the node (.spec.nodeSelector
) and a subset of the default
kubectl drain filters.
Used drain filters:
daemonSetFilter
, skips daemon sets to keep critical workloads alive.mirrorPodFilter
, skips static mirror pods.
Omitted drain filters:
skipDeletedFilter
: updating the condition of already terminating pods should have no downside and will be informative for the user.unreplicatedFilter
: actors who own pods without a controller owner reference should have the opportunity to evacuate their pods. It is a noop if the owner does not respond.localStorageFilter
, we can leave the responsibility of whether to evacuate a pod with local storage (havingEmptyDir
volumes) to the owning workload. For example, a controller of a deployment that has a.spec.strategy.rollingUpdate.maxSurge
defined assumes that it is safe to remove the pod and theEmptyDir
volume.
The selection process can be later enhanced to target daemon set pods according to the priority or pod type.
Controllers that own these marked pods, would observe them and start a removal or migration from
the nodes upon detecting the EvacuationRequest
condition. They will also indicate this by
setting the EvacuationInitiated
condition on the pod.
The node maintenance controller would also remove the EvacuationRequest
condition from the
targeted pods if the NodeMaintenance
object is removed prematurely or if .spec.Drain
is set back
to false
. The condition will only be removed if the reason of the condition is NodeMaintenance
.
If the reason has a different value, then it is owned by another controller (e.g. descheduler) and
we should keep the condition.
The controller can show progress by reconciling:
.status.nodes["worker-1"].PodsPendingEvacuation
, to show how many pods remain to be removed from the node"worker-1"
..status.nodes["worker-1"].PodsEvacuating
, to show how many pods have been accepted for the evacuation from the node"worker-1"
. These are the pods that have theEvacuationInitiated
condition set toTrue
.- To keep track of the entire maintenance the controller will reconcile a
Drained
condition and set it to true if all pods pending evacuation/termination have terminated on all target nodes when drain is requested by the maintenance object. - NodeMaintenance condition or annotation can be set on the node object to advertise the current phase of the maintenance.
The replica set controller will watch its pods and count the number of pods it observes with a
EvacuationRequest
condition. It will then store this count in .status.ReplicasToEvacuate
The deployment controller will watch its ReplicaSets and react when it observes positive number of
pods in .status.ReplicasToEvacuate
. If the owning object of the targeted pods is a Deployment with
a positive .spec.strategy.rollingUpdate.maxSurge
value, the controller will create surge pods by
scaling up the ReplicaSet. The new pods will not be scheduled on the maintained node because the
.spec.unschedulable
field would be set to true on that node. As soon as the surge pods become
available, the deployment controller will scale down the ReplicaSet. The replica set controller
will then in turn delete the pods with the EvacuationRequest
condition.
For completeness, the deployment controller will also track the total number of targeted pods of
all its ReplicaSets under its .status.ReplicasToEvacuate
.
If the node maintenance prematurely ends before the surge process has a chance to complete, the deployment controller will scale down the ReplicaSet which will then remove the extra pods that were created during the surge.
type ReplicaSetStatus struct {
...
ReplicasToEvacuate int32
...
}
type DeploymentStatus struct {
...
ReplicasToEvacuate int32
...
}
To support providing a response to the drain process that the evacuation has begun. The deployment
controller will annotate all replica sets that support the evacuation with the annotation
deployment.kubernetes.io/evacuation-ready
. For now this will apply to replica sets of deployments
with .spec.strategy.rollingUpdate.maxSurge
. When this annotation is present, the replication
controller will respond to all evacuation requests by setting the EvacuationInitiated
condition
to all of its pods with the EvacuationRequest
condition.
The following diagrams describe how the node drain process will change in respective to each component.
Current state of node drain:
Proposed node drain:
[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
<package>
:<date>
-<test coverage>
- :
- :
- Feature gate
- Feature gate name: DeclarativeNodeMaintenance - this feature gate enables the NodeMaintenance API and node
maintenance controller which sets
EvacuationRequest
condition on pods - Components depending on the feature gate: kube-apiserver, kube-controller-manager
- Feature gate name: NodeMaintenanceDeployment - this feature gate enables pod surging in deployment
controller when
EvacuationRequest
condition appears. - Components depending on the feature gate: kube-apiserver, kube-controller-manager
- Feature gate name: DeclarativeNodeMaintenance - this feature gate enables the NodeMaintenance API and node
maintenance controller which sets
- Other
- Describe the mechanism: changes to kubectl drain, kubectl cordon and kubectl uncordon will be behind
an alpha env variable called
KUBECTL_ENABLE_DECLARATIVE_NODE_MAINTENANCE
- Will enabling / disabling the feature require downtime of the control plane? No
- Will enabling / disabling the feature require downtime or reprovisioning of a node? No
- Describe the mechanism: changes to kubectl drain, kubectl cordon and kubectl uncordon will be behind
an alpha env variable called
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
With the .spec.MinAvailable
and .spec.MaxUnavailable
options, PDB could unfortunately allow
surge pods to be evicted because .status.CurrentHealthy
could become higher than
.status.desiredHealthy
. This could disrupt ongoing migration of an application. To support
mission-critical applications, we might do one of the following:
- Consider pods with an
EvacuationInitiated
condition as disrupted and thus decreasing.status.CurrentHealthy
. - Or increase the
.status.desiredHealthy
by the number of pods with theEvacuationInitiated
condition. This would keep the.status.DisruptionsAllowed
value low enough not to disrupt the migration.
We can also count the evacuating pods in the status for observability, as this feature changes the behavior of the PodDisruptionBudget status.
type PodDisruptionBudgetStatus struct {
...
// total number of pods that are evacuating and expected to be terminated
EvacuatingPods int32
...
}
We could implement NodeMainentance API out-of-tree first as a CRD with node maintenance controller.
One of the problems is that it would be difficult to get real word adoption and thus important feedback on this feature. This is mainly due to the requirement that the feature be implemented and integrated with multiple components to observe the benefits. And those components are both admin and application developer facing.
There is a Node Maintenance Operator project that provides a similar API and has some adoption. But it is not at a level where applications could depend on this API to be present in the cluster. So it doesn't make that much sense to implement the app migration logic as it cannot be applied everywhere. As shown in the motivation part of this KEP, there is a big appetite for having a unified and stable API that could be used by everyone to implement the new capabilities.
As an alternative, it would be possible to signal the node maintenance by marking the node object
instead of introducing a new API. But it is probably better to decouple this from the node for
extensibility reasons. As we can see, the kubectl drain
logic is a bit complex, and it may be
possible to move this logic to a controller in the future and make the node maintenance purely
declarative.
Additional benefits of the NodeMaintenance API approach:
- It helps to decouple RBAC permissions from the node object.
- Two or more different actors may want to maintain the same node in two different overlapping time slots. Creating two different NodeMaintenance objects would help with tracking each maintenance together with the reason behind it.
To signal the start of the eviction we could simply taint a node with the NoExecute
taint. This
taint should be easily recognizable and have a standard name, such as
node.kubernetes.io/maintenance
. Other actors could observe the creations of such a taint and
migrate or delete the pod. To ensure pods are not removed prematurely, application owners would
have to set a toleration on their pods for this maintenance taint. Such applications could also set
.spec.tolerations[].tolerationSeconds
, which would give a deadline for the pods to be removed by
the NoExecuteTaintManager.
This approach has the following disadvantages:
- Taints and tolerations do not support PDBs, which is the main mechanism for preventing voluntary disruptions. People who want to avoid the disruptions caused by the maintenance taint would have to specify the toleration in the pod definition and ensure it is present at all times. This would also have an impact on the controllers, who would have to pollute the pod definitions with these tolerations, even though the users did not specify them in their pod template. The controllers could override users' tolerations, which the users might not be happy about. It is also hard to make such behaviors consistent across all the controllers.
- Taints are used as a mechanism for involuntary disruption; to get pods out of the node for some reason (e.g. node is not ready). Modifying the taint mechanism to be less harmful (e.g. by adding a PDB support) is not possible due to the original requirements.
These names are considered as an alternative to NodeMaintenance:
- NodeIsolation
- NodeDetachment
- NodeClearance
- NodeQuarantine
- NodeDisengagement
- NodeVacation