-
Notifications
You must be signed in to change notification settings - Fork 40.4k
DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/sig node |
/cc @wzshiming @bobbypage |
Thank you for this @kwilczynski This is very informative for me. I was just wondering since I believe controllers are owned by sig-apps in general, if there are fixes that needed to be done on DaemonSet controllers, should sig-apps be concerned about this issue or at least the PRs that decide to work on this issue on the DaemonSet side? |
@tuibeovince, thank you! We could use input from the other SIG on the best way to move forward here. Why? Because the DaemonSet controller works correctly and follows the expected behaviour. There have been discussions in the past about whether the DaemonSet controller should or should observe the current node status or any taints and tolerations set, and then either schedule workloads or not on a given node. However, the current behaviour is what has been agreed upon. |
/sig apps |
cc @mimowo I was discussing this with @kwilczynski. Do you have any insight into this problem? I remember there was some work done on Statefulsets and pod transitions but wasn't sure what happened with DaemonSets. |
I don't know about this problem.
The closest I can think of was #118716, but not sure related this is. |
Potentially related:
|
This part seems problematic. The DaemonSets may be running important services (critical priority) that should be available even during shutdown. So I would expect these pods to get admitted, until their priority is being shut down. Daemon set controller's mission is to reconcile the current state (DaemonSet, Node, Pod) to achieve the desired state. There is not enough information to instruct the daemon set controller when you introduce a simple condition or taint. The controller needs to know which DaemonSets with what priorities should run on each node and reconcile this accordingly.
This might be surprising for users or you would need additional configuration for kubelet. And it could double the shutdown time. I think, to fix this, there would have to be a signal about the progress of the graceful shutdown (by priority).
This is under discussion, and it is possible that we will add DaemonSet removal to the node drain flow as part of the newly discussed Declarative Node Maintenance KEP. Details
It is also part of the non-goals section https://github.com/kubernetes/enhancements/blob/0e401d1e3fe2eae82b9e876b8bf30a6b7ef8ffba/keps/sig-apps/4212-declarative-node-maintenance/README.md#non-goals Details
Similar issues occur when also doing the node drain. Kubelet could offload this functionality to the NodeMaintenance in the future and just create such an object when graceful termination is observed. The node maintenance controller would sync the status of NodeMaintenance according to which priorities/types of pods are eligible to be running on each node. The daemon set controller could then react to this and coordinate pods for both node shutdown and node maintenance. |
So my general understanding of the problem is that Kubelet does have the possibility of setting conditions on the node. This is used to set pressure on the node. Eviction manager sets a few node conditions in eviction.go (really it sets an array of conditions) and this does get populated to the node status. I wonder if one possible solution would be to introduce a new
This is already possible with the gracefulshutdown by pod priority. |
I mean, it should be possible to accept a new critical priority workload even during a shutdown, if the low priority workloads are still running/shutting down. As mentioned above, the code does not seem to do this. Detailskubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go Lines 146 to 157 in 9c1c603
For example, you might have inconsistent behavior/bugs in the following case: Normal flow
Non-graceful flow
|
@atiratree, this is as intended. The code will terminate workloads in priority order (there is support for this), but it will not allow any new workloads, no matter what their status or priority might be, to be placed on the underlying node. This simplicity avoids dealing with various edge cases, such as the one you outlined. |
I think the simplicity can hurt some of the workloads. It also goes against the declarative nature of DaemonSets and the Kubelet configuration. It would be good to get some feedback/use cases from people running these DaemonSets, before graduating this feature to GA. @kubernetes/sig-storage-leads are you okay with critical priority DaemonSets being unavailable/terminated before the normal priority user workloads are terminated during a graceful shutdown? (please see #122912 (comment)) |
[...]
I believe we are terminating critical and highest priority workloads last, starring with the lowest ones, per: kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go Lines 452 to 488 in e566bd7
See:
Unless you had something else in mind? |
I meant this as a continuation of our thread about the edge cases, not the main/standard flow. |
/cc @ndixita |
/remove-kind bug |
A point of curiosity, but does this mean that the normal GracefulNodeShutdown does not necessarily look into pod's priority? Shouldn't something like that be a more default behavior and a non-priority based Graceful shutdown be an option? |
This is the default behaviour. There are two concepts here:
If no dedicated graceful shutdown periods are provided for Pods of different priorities, then the same graceful shutdown period is used for everything (something that the user also configures in order to enable this feature). |
I still am trying to understand but, are pod priorities only dependent on the individual grace-period settings of each pod? |
Feel free to peruse the following KEP that was dedicated to this feature: |
/triage accepted |
To clarify: This issue does not impact DaemonSets and any other workloads post-startup (or after the node has been brought back up after it had been shut down). Problems described here pertain only to the time when the Graceful Node Shutdown code detects a restart or shutdown signal while the node is in the process of changing its state (to either restart or power off). |
Is there an ideal graceful flow for handling Daemonset pods at this moment? Would it be something like: Pattern A:
or Pattern B:
|
@tuibeovince, the DaemonSet scheduled pods should not be running on a node that is shutting down and has the Graceful Node Shutdown feature enabled. There was never an option, at least as per the KEP, to allow certain types of workloads to remain running while the node is shutting down. So, no, we aren't handling DaemonSet pods; we treat these as any other pods, subject to the same approach to termination as other pods. |
/unassign @kwilczynski |
What happened?
Currently, when the Graceful Node Shutdown is enabled, as the node is shutting down, workloads currently running will be terminated in order of the workload type and priority, while taking into the account the termination grace period that a given workload can have set.
However, even though the DaemonSet type workloads were also successfully vacated from a node that is currently shutting down, there are some other lingering issues with this specific workload type.
When a node shutdown is detected, the Shutdown Manager will then proceed to update the node status to "NotReady" (this does not set the "Unschedulable" flag similarly to what cordoning a node would), adds necessary taints, and then proceeds to terminate currently runnings workloads taking into the account priorities for different type of workloads and also ensuring that crucial system workloads (often run as DaemonSets) are to be terminated the last—attempting to ensure that node will not lose its network access, the metrics can still be sent, and monitoring won't be interrupted to the very end.
Part of the process of preparing the underlying node for shutdown is to add an admission callback that has a singular purpose—reject everything and anything attempting to start on this specific node. The code is very simple and makes no attempts to handle different types of workloads; everything in unison will be rejected during the time when the node is shutting down.
(an example taken from kubernetes/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L146-L157)
kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go
Lines 146 to 157 in 9c1c603
This is where the DaemonSet controller and Graceful Node Shutdown manager disagree—the former sees that its previously managed workloads were removed, assumes that this is an invalid state and attempts to reconcile it starting missing Pods again, whereas the latter sees a new Pod being scheduled on the given node that is currently shutting down, and rejects it (like any other workload of any type) from being run. This cycle will repeat either until the workloads restart policy and errors budget tolerance will allow or indefinitely (or for as long as the node is still pending its shutdown or restart).
(an example taken from kubernetes/kubernetes/pkg/controller/daemon/daemon_controller.go#L1278-L1301)
kubernetes/pkg/controller/daemon/daemon_controller.go
Lines 1278 to 1301 in 9c1c603
(an example taken from kubernetes/kubernetes/pkg/controller/daemon/daemon_controller.go#L1304-L1313)
kubernetes/pkg/controller/daemon/daemon_controller.go
Lines 1304 to 1313 in 9c1c603
As the worker node can run workloads that take a long time to stop and terminate, the number of workloads with scheduling failure can keep increasing over time, leaving these failed attempts to have to be cleaned manually, such as the "NodeShutdown" (do these type of statuses need manual clean up too?), or even being permanently stuck in the "Pending" or "Terminating" states. Workloads terminated with the "Completed" should, provided that everything works as intended, be cleaned up and rescheduled automatically, especially once the node is back up and running again.
Some other workloads might report scheduling failures, leaving workloads in an "Error" state that would also require manual clean-up.
An example of this taking place during a node shutdown:
Peek.2024-01-17.17-21.webm
What did you expect to happen?
Most of the workload types (Pods, Deployments, ReplicaSets, DaemonSets, etc.) are to be correctly removed from the underlying node, and then the remaining Pods terminated.
How can we reproduce it (as minimally and precisely as possible)?
Only a few steps are needed to reproduce this behaviour:
This can be done using the following kubelet configuration properties:
A very simple kubelet configuration file as an example:
An example
test.yaml
:Emit the
org.freedesktop.login1.Manager.PrepareForShutdown
signal to simulate the node being shut down:An example using the
gdbus
utility:# gdbus emit --system --object-path /org/freedesktop/login1 --signal 'org.freedesktop.login1.Manager.PrepareForShutdown' true
You should see the following entries in the kubelet log (provided a sufficient log level is set):
Plus, kubelet should already be working towards terminating workloads on the given node.
After a few moments, the DaemonSet controller would attempt to schedule workloads, including the test one, back on the worker node that is currently shutting down.
An excerpt from logs captured when this takes place per:
(the affected Pod name is test-cj4n2)
kwilczynski/a194305c0b2324aa1d254cccc40da5b3
(the affected Pod name is node-exporter-9cvc2)
kwilczynski/b2baa49657501c2ab8ad98e09d853ac2
Anything else we need to know?
An important characteristic of a DaemonSet is the ability to ignore most default worker node statuses (such as "NotReady" and "SchedulingDisabled"), ignore the "Unschedulable" flag set on a given node, and also ignore most of the default node taints. The basic idea is to allow workloads run as DaemonSets to run everywhere, including nodes that should not potentially schedule any other types of workloads.
This design decision, which is also the key feature of a DaemonSet, is something that can interfere with the Graceful Node Shutdown feature.
How to fix this problem? Some viable fixes would be:
Some of the above could be controlled via a new setting for kubelet, like the behaviour that controls whether to evict DaemonSets and when.
Additionally, the "Unschedulable" could be set to "true" for a good measure, to ensure that absolutely no workloads will attempt to schedule themselves. Or to make it harder to do so.
Recently, there have been issues with status updates and synchronisation of different types of workloads when a node was shutting down. This has since appears to have been fixed via the following Pull Requests:
However, some newer reports claim the contrary per:
Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
A lot of things. Probably too many to list. See the following for details:
The text was updated successfully, but these errors were encountered: