DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912

kwilczynski · 2024-01-22T16:38:05Z

What happened?

Currently, when the Graceful Node Shutdown is enabled, as the node is shutting down, workloads currently running will be terminated in order of the workload type and priority, while taking into the account the termination grace period that a given workload can have set.

However, even though the DaemonSet type workloads were also successfully vacated from a node that is currently shutting down, there are some other lingering issues with this specific workload type.

When a node shutdown is detected, the Shutdown Manager will then proceed to update the node status to "NotReady" (this does not set the "Unschedulable" flag similarly to what cordoning a node would), adds necessary taints, and then proceeds to terminate currently runnings workloads taking into the account priorities for different type of workloads and also ensuring that crucial system workloads (often run as DaemonSets) are to be terminated the last—attempting to ensure that node will not lose its network access, the metrics can still be sent, and monitoring won't be interrupted to the very end.

Part of the process of preparing the underlying node for shutdown is to add an admission callback that has a singular purpose—reject everything and anything attempting to start on this specific node. The code is very simple and makes no attempts to handle different types of workloads; everything in unison will be rejected during the time when the node is shutting down.

(an example taken from kubernetes/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L146-L157)

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Lines 146 to 157 in 9c1c603

    
           func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult { 
        
           	nodeShuttingDown := m.ShutdownStatus() != nil 
        
           	if nodeShuttingDown { 
        
           		return lifecycle.PodAdmitResult{ 
        
           			Admit:   false, 
        
           			Reason:  nodeShutdownNotAdmittedReason, 
        
           			Message: nodeShutdownNotAdmittedMessage, 
        
           		} 
        
           	} 
        
           	return lifecycle.PodAdmitResult{Admit: true} 
        
           }

This is where the DaemonSet controller and Graceful Node Shutdown manager disagree—the former sees that its previously managed workloads were removed, assumes that this is an invalid state and attempts to reconcile it starting missing Pods again, whereas the latter sees a new Pod being scheduled on the given node that is currently shutting down, and rejects it (like any other workload of any type) from being run. This cycle will repeat either until the workloads restart policy and errors budget tolerance will allow or indefinitely (or for as long as the node is still pending its shutdown or restart).

(an example taken from kubernetes/kubernetes/pkg/controller/daemon/daemon_controller.go#L1278-L1301)

kubernetes/pkg/controller/daemon/daemon_controller.go

Lines 1278 to 1301 in 9c1c603

    
           func NodeShouldRunDaemonPod(node *v1.Node, ds *apps.DaemonSet) (bool, bool) { 
        
           	pod := NewPod(ds, node.Name) 
        
           	// If the daemon set specifies a node name, check that it matches with node.Name. 
        
           	if !(ds.Spec.Template.Spec.NodeName == "" || ds.Spec.Template.Spec.NodeName == node.Name) { 
        
           		return false, false 
        
           	} 
        
           	taints := node.Spec.Taints 
        
           	fitsNodeName, fitsNodeAffinity, fitsTaints := predicates(pod, node, taints) 
        
           	if !fitsNodeName || !fitsNodeAffinity { 
        
           		return false, false 
        
           	} 
        
           	if !fitsTaints { 
        
           		// Scheduled daemon pods should continue running if they tolerate NoExecute taint. 
        
           		_, hasUntoleratedTaint := v1helper.FindMatchingUntoleratedTaint(taints, pod.Spec.Tolerations, func(t *v1.Taint) bool { 
        
           			return t.Effect == v1.TaintEffectNoExecute 
        
           		}) 
        
           		return false, !hasUntoleratedTaint 
        
           	} 
        
           	return true, true 
        
           }

(an example taken from kubernetes/kubernetes/pkg/controller/daemon/daemon_controller.go#L1304-L1313)

kubernetes/pkg/controller/daemon/daemon_controller.go

Lines 1304 to 1313 in 9c1c603

    
           func predicates(pod *v1.Pod, node *v1.Node, taints []v1.Taint) (fitsNodeName, fitsNodeAffinity, fitsTaints bool) { 
        
           	fitsNodeName = len(pod.Spec.NodeName) == 0 || pod.Spec.NodeName == node.Name 
        
           	// Ignore parsing errors for backwards compatibility. 
        
           	fitsNodeAffinity, _ = nodeaffinity.GetRequiredNodeAffinity(pod).Match(node) 
        
           	_, hasUntoleratedTaint := v1helper.FindMatchingUntoleratedTaint(taints, pod.Spec.Tolerations, func(t *v1.Taint) bool { 
        
           		return t.Effect == v1.TaintEffectNoExecute || t.Effect == v1.TaintEffectNoSchedule 
        
           	}) 
        
           	fitsTaints = !hasUntoleratedTaint 
        
           	return 
        
           }

As the worker node can run workloads that take a long time to stop and terminate, the number of workloads with scheduling failure can keep increasing over time, leaving these failed attempts to have to be cleaned manually, such as the "NodeShutdown" (do these type of statuses need manual clean up too?), or even being permanently stuck in the "Pending" or "Terminating" states. Workloads terminated with the "Completed" should, provided that everything works as intended, be cleaned up and rescheduled automatically, especially once the node is back up and running again.

Some other workloads might report scheduling failures, leaving workloads in an "Error" state that would also require manual clean-up.

An example of this taking place during a node shutdown:

Peek.2024-01-17.17-21.webm

What did you expect to happen?

Most of the workload types (Pods, Deployments, ReplicaSets, DaemonSets, etc.) are to be correctly removed from the underlying node, and then the remaining Pods terminated.

How can we reproduce it (as minimally and precisely as possible)?

Only a few steps are needed to reproduce this behaviour:

Configure kubelet on a single worker node to enable Graceful Node Shutdown

This can be done using the following kubelet configuration properties:

shutdownGracePeriod: 3600s
shutdownGracePeriodCriticalPods: 1800s

A very simple kubelet configuration file as an example:

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 172.17.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: ""
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMaximumGCAge: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
  flushFrequency: 0
  options:
    json:
      infoBufferSize: "0"
  verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 3600s
shutdownGracePeriodCriticalPods: 1800s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

Prepare a simple test workload (a simple DaemonSet)

An example test.yaml:

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: test
  namespace: default
  labels:
    app.kubernetes.io/name: test
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: test
  template:
    metadata:
      labels:
        app.kubernetes.io/name: test
    spec:
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
        - operator: Exists
      containers:
        - name: test
          image: docker.io/library/bash:latest
          command:
            - /bin/sh
            - -c
            - |
              ( while true ; do date ; sleep 1 ; done ) &
              pid=$!
              echo "Starting!"
              _term () {
                  kill $pid
                  echo "Caught SIGTERM!"
              }
              trap _term TERM
              wait $pid
              trap - TERM
              wait $pid
              exit 0
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - |
                    echo BEGIN preStop > /proc/1/fd/1
                    sleep 10
                    echo END preStop > /proc/1/fd/1
      terminationGracePeriodSeconds: 30
      restartPolicy: Always

Simulate the node shutdown

Emit the org.freedesktop.login1.Manager.PrepareForShutdown signal to simulate the node being shut down:

An example using the gdbus utility:

# gdbus emit --system --object-path /org/freedesktop/login1 --signal 'org.freedesktop.login1.Manager.PrepareForShutdown' true

You should see the following entries in the kubelet log (provided a sufficient log level is set):

Jan 10 07:28:44 worker-node01 kubelet[6188]: I0110 07:28:44.397678    6188 nodeshutdown_manager_linux.go:265] "Shutdown manager detected new shutdown event, isNodeShuttingDownNow" event=true
Jan 10 07:28:44 worker-node01 kubelet[6188]: I0110 07:28:44.397692    6188 nodeshutdown_manager_linux.go:273] "Shutdown manager detected new shutdown event" event="shutdown"
Jan 10 07:28:44 worker-node01 kubelet[6188]: I0110 07:28:44.397703    6188 nodeshutdown_manager_linux.go:322] "Shutdown manager processing shutdown event"
Jan 10 07:28:44 worker-node01 kubelet[6188]: I0110 07:28:44.398563    6188 event.go:376] "Event occurred" object="worker-node01" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="Shutdown" message="Shutdown manager detected shutdown event"

Plus, kubelet should already be working towards terminating workloads on the given node.

After a few moments, the DaemonSet controller would attempt to schedule workloads, including the test one, back on the worker node that is currently shutting down.

An excerpt from logs captured when this takes place per:

(the affected Pod name is test-cj4n2)

kwilczynski/a194305c0b2324aa1d254cccc40da5b3

(the affected Pod name is node-exporter-9cvc2)

kwilczynski/b2baa49657501c2ab8ad98e09d853ac2

Anything else we need to know?

An important characteristic of a DaemonSet is the ability to ignore most default worker node statuses (such as "NotReady" and "SchedulingDisabled"), ignore the "Unschedulable" flag set on a given node, and also ignore most of the default node taints. The basic idea is to allow workloads run as DaemonSets to run everywhere, including nodes that should not potentially schedule any other types of workloads.

This design decision, which is also the key feature of a DaemonSet, is something that can interfere with the Graceful Node Shutdown feature.

How to fix this problem? Some viable fixes would be:

Do not remove DaemonSets from the node that is due to be shut down (not too dissimilar to what cordoning and draining a node would do)
Teach the DaemonSet controller about "NodeShutdown" status for a node so that it avoids attempting to schedule Pods there
Add a new type of taint and teach the DaemonSet controller to include it as part of its scheduling decision making
Teach the Graceful Node Shutdown manager to ignore DaemonSets until the very last moment before these types of workloads are due to be evicted

Some of the above could be controlled via a new setting for kubelet, like the behaviour that controls whether to evict DaemonSets and when.

Additionally, the "Unschedulable" could be set to "true" for a good measure, to ensure that absolutely no workloads will attempt to schedule themselves. Or to make it harder to do so.

Recently, there have been issues with status updates and synchronisation of different types of workloads when a node was shutting down. This has since appears to have been fixed via the following Pull Requests:

However, some newer reports claim the contrary per:

Kubernetes version

Kubernetes

$ kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0

OpenShift

$ oc version 
Client Version: 4.14.4
Kustomize Version: v5.0.1
Server Version: 4.14.4
Kubernetes Version: v1.27.6+d548052

Cloud provider

N/A

OS version

Kubernetes

# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux worker-node01 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

OpenShift

# On Linux:
$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="414.92.202311222314-0"
VERSION_ID="4.14"
VARIANT="CoreOS"
VARIANT_ID=coreos
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 414.92.202311222314-0 (Plow)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.14/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.14"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.14"
OPENSHIFT_VERSION="4.14"
RHEL_VERSION="9.2"
OSTREE_VERSION="414.92.202311222314-0"

Install tools

Kubernetes

$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.0", GitCommit:"3f7a50f38688eb332e2a1b013678c6435d539ae6", GitTreeState:"clean", BuildDate:"2023-12-13T08:50:10Z", GoVersion:"go1.21.5", Compiler:"gc", Platform:"linux/amd64"}

OpenShift

$ openshift-install version
openshift-install 4.14.4
built from commit 82051a1c3d42e394be353bbe2eacf348ccb75970
release image registry.ci.openshift.org/ocp/release@sha256:e6e1d90b492d50438034e6edb46bdafa6c86ae3b80ef3328685912d89681fdee
release architecture amd64

Container runtime (CRI) and version (if applicable)

Kubernetes

$ crio --version
crio version 1.29.0
Version:        1.29.0
GitCommit:      d59bbdc252837107c9f5d235b8fb2650ff2b9d93
GitCommitDate:  2023-12-21T16:23:14Z
GitTreeState:   clean
BuildDate:      1970-01-01T00:00:00Z
GoVersion:      go1.21.1
Compiler:       gc
Platform:       linux/amd64
Linkmode:       static
BuildTags:      
  static
  netgo
  osusergo
  exclude_graphdriver_btrfs
  exclude_graphdriver_devicemapper
  seccomp
  apparmor
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false

$ conmon --version
conmon version 2.1.10
commit: 09bcded8e9c49cf1ff1fda403feac5a08f22535f-dirty

$ crun --version
crun version 1.12
commit: ce429cb2e277d001c2179df1ac66a470f00802ae
rundir: /run/user/1000/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL

$ runc --version
runc version 1.1.11
commit: v1.1.11-0-g4bccb38c
spec: 1.0.2-dev
go: go1.20.12
libseccomp: 2.5.4

OpenShift

$ crio --version
crio version 1.27.2-2.rhaos4.14.git9d684e2.el9
Version:        1.27.2-2.rhaos4.14.git9d684e2.el9
GitCommit:      unknown
GitCommitDate:  unknown
GitTreeState:   clean
GoVersion:      go1.20.10
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:      
  rpm_crashtraceback
  libtrust_openssl
  selinux
  seccomp
  exclude_graphdriver_devicemapper
  exclude_graphdriver_btrfs
  containers_image_ostree_stub
LDFlags:           -compressdwarf=false -B 0x8c7c7209f406ead112dcaacb10b18ba6826efc5e -extldflags '-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 '
SeccompEnabled:   true
AppArmorEnabled:  false

$ conmon --version
conmon version 2.1.7
commit: a3013c55c8bfc2d0597ffe8beedadba8aab5bbf6

$ runc --version
runc version 1.1.10
spec: 1.0.2-dev
go: go1.20.10
libseccomp: 2.5.2

Related plugins (CNI, CSI, ...) and versions (if applicable)

Kubernetes

$ crictl exec -it ee8826f196d0a calicoctl version
Client Version:    v3.25.0
Git commit:        3f7fe4d29
Cluster Version:   v3.25.0
Cluster Type:      k8s,kdd,bgp,kubeadm

OpenShift

A lot of things. Probably too many to list. See the following for details:

OpenShift Container Platform 4.14 release notes

The text was updated successfully, but these errors were encountered:

kwilczynski · 2024-01-22T16:38:39Z

/sig node

pacoxu · 2024-01-23T02:59:36Z

/cc @wzshiming @bobbypage

tuibeovince · 2024-01-25T01:41:23Z

Thank you for this @kwilczynski

This is very informative for me.

I was just wondering since I believe controllers are owned by sig-apps in general, if there are fixes that needed to be done on DaemonSet controllers, should sig-apps be concerned about this issue or at least the PRs that decide to work on this issue on the DaemonSet side?

kwilczynski · 2024-01-26T13:19:08Z

@tuibeovince, thank you!

We could use input from the other SIG on the best way to move forward here. Why? Because the DaemonSet controller works correctly and follows the expected behaviour.

There have been discussions in the past about whether the DaemonSet controller should or should observe the current node status or any taints and tolerations set, and then either schedule workloads or not on a given node. However, the current behaviour is what has been agreed upon.

kannon92 · 2024-01-30T15:26:31Z

/sig apps

kannon92 · 2024-01-30T15:30:35Z

cc @mimowo

I was discussing this with @kwilczynski. Do you have any insight into this problem? I remember there was some work done on Statefulsets and pod transitions but wasn't sure what happened with DaemonSets.

mimowo · 2024-01-30T15:46:16Z

I was discussing this with @kwilczynski. Do you have any insight into this problem?

I don't know about this problem.

I remember there was some work done on Statefulsets and pod transitions but wasn't sure what happened with DaemonSets.

The closest I can think of was #118716, but not sure related this is.

kwilczynski · 2024-01-31T16:42:09Z

Potentially related:

Unexpected Running state for DaemonSet Pods on NotReady worker nodes #117769
pod was rescheduled to node which being shutdown #121843 (for the scheduler plugin idea)

atiratree · 2024-02-02T16:29:14Z

Part of the process of preparing the underlying node for shutdown is to add an admission callback that has a singular purpose—reject everything and anything attempting to start on this specific node. The code is very simple and makes no attempts to handle different types of workloads; everything in unison will be rejected during the time when the node is shutting down.

This part seems problematic. The DaemonSets may be running important services (critical priority) that should be available even during shutdown. So I would expect these pods to get admitted, until their priority is being shut down.

Daemon set controller's mission is to reconcile the current state (DaemonSet, Node, Pod) to achieve the desired state. There is not enough information to instruct the daemon set controller when you introduce a simple condition or taint. The controller needs to know which DaemonSets with what priorities should run on each node and reconcile this accordingly.

Teach the Graceful Node Shutdown manager to ignore DaemonSets until the very last moment before these types of workloads are due to be evicted

This might be surprising for users or you would need additional configuration for kubelet. And it could double the shutdown time.

I think, to fix this, there would have to be a signal about the progress of the graceful shutdown (by priority).

Do not remove DaemonSets from the node that is due to be shut down (not too dissimilar to what cordoning and draining a node would do)

This is under discussion, and it is possible that we will add DaemonSet removal to the node drain flow as part of the newly discussed Declarative Node Maintenance KEP.

https://github.com/kubernetes/enhancements/pull/4213/files#diff-a3337600d2065dcf21be514dcf3c3c62499168a12ae2e4f1aa2db55fddaddbecR223

Details

Graceful deletion of DaemonSet pods is currently only supported as part of (Linux) graceful node
shutdown. The length of the shutdown is again not application specific and is set cluster-wide
(optionally by priority) by the cluster admin. This does not take into account
.spec.terminationGracePeriodSeconds of each pod and may cause premature termination of
the application. This has been discussed in issue kubernetes/kubernetes#75482
and in issue kubernetes-sigs/cluster-api#6158.

There are cases during a node shutdown, when data corruption can occur due to premature node
shutdown. It would be great if applications could perform data migration and synchronization of
cached writes to the underlying storage before the pod deletion occurs. This is not easy to
quantify even with pod's .spec.shutdownGracePeriod, as the time depends on the size of the data
and the speed of the storage. This has been discussed in issue kubernetes/kubernetes#116618
and in issue kubernetes/kubernetes#115148.

It is also part of the non-goals section https://github.com/kubernetes/enhancements/blob/0e401d1e3fe2eae82b9e876b8bf30a6b7ef8ffba/keps/sig-apps/4212-declarative-node-maintenance/README.md#non-goals

Details

Introduce a field that could include non-critical daemon set pods (priority system-cluster-critical or system-node-critical) for node maintenance/drain request. The daemon set controller would then gracefully shut down these pods. Critical pods could be overridden by the priority list mentioned below.

NodeMaintenance could include a plan of which pods to target first. Similar to graceful node shutdown, we could include a list of priorities to decide which pods should be terminated first. This list could optionally include pod timeouts, but could also wait for all the pods of a given priority class to finish first without a timeout. This could also be used to target daemon set pods of certain priorities (see point above). We could also introduce drain profiles based on these lists. The cluster admin could then choose or create such a profile based on his/her needs. The logic for processing the decision list would be contained in the node maintenance controller, which would set an intent to selected pods to shut down via the EvacuationRequest condition.

Similar issues occur when also doing the node drain. Kubelet could offload this functionality to the NodeMaintenance in the future and just create such an object when graceful termination is observed. The node maintenance controller would sync the status of NodeMaintenance according to which priorities/types of pods are eligible to be running on each node. The daemon set controller could then react to this and coordinate pods for both node shutdown and node maintenance.

kannon92 · 2024-02-02T20:54:55Z

So my general understanding of the problem is that daemonSets are ignoring the normal taints/tolerations (by design) and they continue to run pods.

Kubelet does have the possibility of setting conditions on the node. This is used to set pressure on the node. Eviction manager sets a few node conditions in eviction.go (really it sets an array of conditions) and this does get populated to the node status.

I wonder if one possible solution would be to introduce a new nodeStatusCondition for gracefulshutdown. If the gracefulshutdown controller sets a condition on the node, then we could potentially use that to force our designed behavior on a DaemonSet.

This part seems problematic. The DaemonSets may be running important services (critical priority) that should be available even during shutdown. So I would expect these pods to get admitted, until their priority is being shut down.

This is already possible with the gracefulshutdown by pod priority.

atiratree · 2024-02-05T10:23:12Z

This is already possible with the gracefulshutdown by pod priority.

I mean, it should be possible to accept a new critical priority workload even during a shutdown, if the low priority workloads are still running/shutting down. As mentioned above, the code does not seem to do this.

Details

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Lines 146 to 157 in 9c1c603

    
           func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult { 
        
           	nodeShuttingDown := m.ShutdownStatus() != nil 
        
           	if nodeShuttingDown { 
        
           		return lifecycle.PodAdmitResult{ 
        
           			Admit:   false, 
        
           			Reason:  nodeShutdownNotAdmittedReason, 
        
           			Message: nodeShutdownNotAdmittedMessage, 
        
           		} 
        
           	} 
        
           	return lifecycle.PodAdmitResult{Admit: true} 
        
           }

For example, you might have inconsistent behavior/bugs in the following case:

Normal flow

daemonset application with critical priority is rolling out a new image on all nodes
daemonset pod is terminated on node A
new revision of daemonset pod is started on node A
node A is being turned off and graceful shutdown is started
the daemonset appplication runs properly during the graceful shutdown

Non-graceful flow

daemonset application with critical priority is rolling out a new image on all nodes
daemonset pod is terminated on node A
node A is being turned off and graceful shutdown is started
new revision of daemonset pod cannot be started on node A
outage of the daemonset application on node A during the graceful shutdown

kwilczynski · 2024-02-06T14:12:08Z

This is already possible with the gracefulshutdown by pod priority.

I mean, it should be possible to accept a new critical priority workload even during a shutdown, if the low priority workloads are still running/shutting down. As mentioned above, the code does not seem to do this.

@atiratree, this is as intended. The code will terminate workloads in priority order (there is support for this), but it will not allow any new workloads, no matter what their status or priority might be, to be placed on the underlying node.

This simplicity avoids dealing with various edge cases, such as the one you outlined.

atiratree · 2024-02-06T15:37:41Z

This simplicity avoids dealing with various edge cases, such as the one you outlined.

I think the simplicity can hurt some of the workloads. It also goes against the declarative nature of DaemonSets and the Kubelet configuration.

It would be good to get some feedback/use cases from people running these DaemonSets, before graduating this feature to GA.

@kubernetes/sig-storage-leads are you okay with critical priority DaemonSets being unavailable/terminated before the normal priority user workloads are terminated during a graceful shutdown? (please see #122912 (comment))

kwilczynski · 2024-02-06T16:44:52Z

[...]

@kubernetes/sig-storage-leads are you okay with critical priority DaemonSets being unavailable/terminated before the normal priority user workloads are terminated during a graceful shutdown?

I believe we are terminating critical and highest priority workloads last, starring with the lowest ones, per:

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Lines 452 to 488 in e566bd7

    
           func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup { 
        
           	groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority)) 
        
           	for _, period := range shutdownGracePeriodByPodPriority { 
        
           		groups = append(groups, podShutdownGroup{ 
        
           			ShutdownGracePeriodByPodPriority: period, 
        
           		}) 
        
           	} 
        
           	for _, pod := range pods { 
        
           		var priority int32 
        
           		if pod.Spec.Priority != nil { 
        
           			priority = *pod.Spec.Priority 
        
           		} 
        
           		// Find the group index according to the priority. 
        
           		index := sort.Search(len(groups), func(i int) bool { 
        
           			return groups[i].Priority >= priority 
        
           		}) 
        
           		// 1. Those higher than the highest priority default to the highest priority 
        
           		// 2. Those lower than the lowest priority default to the lowest priority 
        
           		// 3. Those boundary priority default to the lower priority 
        
           		// if priority of pod is: 
        
           		//   groups[index-1].Priority <= pod priority < groups[index].Priority 
        
           		// in which case we want to pick lower one (i.e index-1) 
        
           		if index == len(groups) { 
        
           			index = len(groups) - 1 
        
           		} else if index < 0 { 
        
           			index = 0 
        
           		} else if index > 0 && groups[index].Priority > priority { 
        
           			index-- 
        
           		} 
        
           		groups[index].Pods = append(groups[index].Pods, pod) 
        
           	} 
        
           	return groups 
        
           }

See:

Unless you had something else in mind?

atiratree · 2024-02-06T17:13:50Z

I meant this as a continuation of our thread about the edge cases, not the main/standard flow.

ndixita · 2024-02-07T18:49:51Z

/cc @ndixita

ndixita · 2024-02-07T18:53:12Z

/remove-kind bug
/kind feature

tuibeovince · 2024-02-08T02:04:09Z

I believe we are terminating critical and highest priority workloads last, starring with the lowest ones, per:

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Lines 452 to 488 in e566bd7

func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup {

groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority))

for _, period := range shutdownGracePeriodByPodPriority {

groups = append(groups, podShutdownGroup{

ShutdownGracePeriodByPodPriority: period,

})

}

for _, pod := range pods {

var priority int32

if pod.Spec.Priority != nil {

priority = *pod.Spec.Priority

}

// Find the group index according to the priority.

index := sort.Search(len(groups), func(i int) bool {

return groups[i].Priority >= priority

})

// 1. Those higher than the highest priority default to the highest priority

// 2. Those lower than the lowest priority default to the lowest priority

// 3. Those boundary priority default to the lower priority

// if priority of pod is:

// groups[index-1].Priority <= pod priority < groups[index].Priority

// in which case we want to pick lower one (i.e index-1)

if index == len(groups) {

index = len(groups) - 1

} else if index < 0 {

index = 0

} else if index > 0 && groups[index].Priority > priority {

index--

}

groups[index].Pods = append(groups[index].Pods, pod)

}

return groups

}

A point of curiosity, but does this mean that the normal GracefulNodeShutdown does not necessarily look into pod's priority? Shouldn't something like that be a more default behavior and a non-priority based Graceful shutdown be an option?

kwilczynski · 2024-02-08T14:39:37Z

I believe we are terminating critical and highest priority workloads last, starring with the lowest ones, per:

kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go

Lines 452 to 488 in e566bd7

func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup {

groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority))

for _, period := range shutdownGracePeriodByPodPriority {

groups = append(groups, podShutdownGroup{

ShutdownGracePeriodByPodPriority: period,

})

}

for _, pod := range pods {

var priority int32

if pod.Spec.Priority != nil {

priority = *pod.Spec.Priority

}

// Find the group index according to the priority.

index := sort.Search(len(groups), func(i int) bool {

return groups[i].Priority >= priority

})

// 1. Those higher than the highest priority default to the highest priority

// 2. Those lower than the lowest priority default to the lowest priority

// 3. Those boundary priority default to the lower priority

// if priority of pod is:

// groups[index-1].Priority <= pod priority < groups[index].Priority

// in which case we want to pick lower one (i.e index-1)

if index == len(groups) {

index = len(groups) - 1

} else if index < 0 {

index = 0

} else if index > 0 && groups[index].Priority > priority {

index--

}

groups[index].Pods = append(groups[index].Pods, pod)

}

return groups

}

A point of curiosity, but does this mean that the normal GracefulNodeShutdown does not necessarily look into pod's priority? Shouldn't something like that be a more default behavior and a non-priority based Graceful shutdown be an option?

This is the default behaviour. There are two concepts here:

Ability to set up different graceful shutdown periods for Pods of different priorities (this is not the default)
Shutdown Pods based on their priority, from lowest to highest

If no dedicated graceful shutdown periods are provided for Pods of different priorities, then the same graceful shutdown period is used for everything (something that the user also configures in order to enable this feature).

tuibeovince · 2024-02-09T08:22:08Z

If no dedicated graceful shutdown periods are provided for Pods of different priorities, then then the same graceful shutdown period is used for everything (something that the user also configures in order to enable this feature).

I still am trying to understand but, are pod priorities only dependent on the individual grace-period settings of each pod?

kwilczynski · 2024-02-09T09:30:13Z

If no dedicated graceful shutdown periods are provided for Pods of different priorities, then then the same graceful shutdown period is used for everything (something that the user also configures in order to enable this feature).

I still am trying to understand but, are pod priorities only dependent on the individual grace-period settings of each pod?

Feel free to peruse the following KEP that was dedicated to this feature:

KEP-2712: Pod Priority Based Graceful Node Shutdown

kannon92 · 2024-02-13T18:46:36Z

/triage accepted
/priority important-longterm

kwilczynski · 2024-02-21T01:19:00Z

To clarify:

This issue does not impact DaemonSets and any other workloads post-startup (or after the node has been brought back up after it had been shut down). Problems described here pertain only to the time when the Graceful Node Shutdown code detects a restart or shutdown signal while the node is in the process of changing its state (to either restart or power off).

tuibeovince · 2024-02-28T07:40:33Z

Normal flow

daemonset application with critical priority is rolling out a new image on all nodes

daemonset pod is terminated on node A

new revision of daemonset pod is started on node A

node A is being turned off and graceful shutdown is started

the daemonset appplication runs properly during the graceful shutdown

Non-graceful flow

daemonset application with critical priority is rolling out a new image on all nodes

daemonset pod is terminated on node A

node A is being turned off and graceful shutdown is started

new revision of daemonset pod cannot be started on node A

outage of the daemonset application on node A during the graceful shutdown

Is there an ideal graceful flow for handling Daemonset pods at this moment?

Would it be something like:

Pattern A:

daemonset application with critical priority is rolling out a new image on all nodes
daemonset pods are given a critical priority to be terminated last
node A is being turned off and graceful shutdown is started
node A drains as normal ignoring daemonsets, daemonset pods not terminated and nothing new starts on node A
the daemonset appplication runs properly during the graceful shutdown until the very last moment

or

Pattern B:

daemonset application with critical priority is rolling out a new image on all nodes
daemonset pod is terminated on node A
nodeshutdown manager freely allows new revision of daemonset pod to start on node A
node A is being turned off and graceful shutdown is started; no other attempts of deleting daemonset pods in node A
the daemonset appplication runs properly during the graceful shutdown until node shuts down

kwilczynski · 2024-05-07T00:33:32Z

@tuibeovince, the DaemonSet scheduled pods should not be running on a node that is shutting down and has the Graceful Node Shutdown feature enabled.

There was never an option, at least as per the KEP, to allow certain types of workloads to remain running while the node is shutting down. So, no, we aren't handling DaemonSet pods; we treat these as any other pods, subject to the same approach to termination as other pods.

kwilczynski · 2025-01-22T09:29:05Z

/unassign @kwilczynski

kwilczynski added the kind/bug Categorizes issue or PR as related to a bug. label Jan 22, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 22, 2024

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 22, 2024

k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jan 30, 2024

github-project-automation bot added this to SIG Apps Jan 30, 2024

github-project-automation bot moved this to Needs Triage in SIG Apps Jan 30, 2024

atiratree mentioned this issue Feb 2, 2024

Graceful node shutdown kubernetes/enhancements#2000

Open

26 tasks

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 7, 2024

kwilczynski mentioned this issue May 13, 2024

KEP-4212: Declarative Node Maintenance kubernetes/enhancements#4213

Open

6 tasks

k8s-ci-robot assigned kwilczynski Jul 12, 2024

kannon92 added this to SIG Node Bugs Jul 22, 2024

kannon92 moved this to Triaged in SIG Node Bugs Jul 22, 2024

SergeyKanzhelev removed this from SIG Node Bugs Aug 9, 2024

jessebye mentioned this issue Nov 9, 2024

Ztunnel pods repeatedly try to schedule to nodes that are shutting down istio/istio#53844

Closed

2 tasks

k8s-ci-robot unassigned kwilczynski Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912

DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912

kwilczynski commented Jan 22, 2024 •

edited

Loading

kwilczynski commented Jan 22, 2024

pacoxu commented Jan 23, 2024

tuibeovince commented Jan 25, 2024

kwilczynski commented Jan 26, 2024

kannon92 commented Jan 30, 2024

kannon92 commented Jan 30, 2024

mimowo commented Jan 30, 2024

kwilczynski commented Jan 31, 2024 •

edited

Loading

atiratree commented Feb 2, 2024

kannon92 commented Feb 2, 2024

atiratree commented Feb 5, 2024

kwilczynski commented Feb 6, 2024

atiratree commented Feb 6, 2024 •

edited

Loading

kwilczynski commented Feb 6, 2024

atiratree commented Feb 6, 2024

ndixita commented Feb 7, 2024

ndixita commented Feb 7, 2024

tuibeovince commented Feb 8, 2024

kwilczynski commented Feb 8, 2024 •

edited

Loading

tuibeovince commented Feb 9, 2024

kwilczynski commented Feb 9, 2024

kannon92 commented Feb 13, 2024

kwilczynski commented Feb 21, 2024

tuibeovince commented Feb 28, 2024

kwilczynski commented May 7, 2024 •

edited

Loading

kwilczynski commented Jan 22, 2025

DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912

DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912

Comments

kwilczynski commented Jan 22, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

kwilczynski commented Jan 22, 2024

pacoxu commented Jan 23, 2024

tuibeovince commented Jan 25, 2024

kwilczynski commented Jan 26, 2024

kannon92 commented Jan 30, 2024

kannon92 commented Jan 30, 2024

mimowo commented Jan 30, 2024

kwilczynski commented Jan 31, 2024 • edited Loading

atiratree commented Feb 2, 2024

kannon92 commented Feb 2, 2024

atiratree commented Feb 5, 2024

kwilczynski commented Feb 6, 2024

atiratree commented Feb 6, 2024 • edited Loading

kwilczynski commented Feb 6, 2024

atiratree commented Feb 6, 2024

ndixita commented Feb 7, 2024

ndixita commented Feb 7, 2024

tuibeovince commented Feb 8, 2024

kwilczynski commented Feb 8, 2024 • edited Loading

tuibeovince commented Feb 9, 2024

kwilczynski commented Feb 9, 2024

kannon92 commented Feb 13, 2024

kwilczynski commented Feb 21, 2024

tuibeovince commented Feb 28, 2024

kwilczynski commented May 7, 2024 • edited Loading

kwilczynski commented Jan 22, 2025

kwilczynski commented Jan 22, 2024 •

edited

Loading

kwilczynski commented Jan 31, 2024 •

edited

Loading

atiratree commented Feb 6, 2024 •

edited

Loading

kwilczynski commented Feb 8, 2024 •

edited

Loading

kwilczynski commented May 7, 2024 •

edited

Loading