Skip to content

DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision #122912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #2000
kwilczynski opened this issue Jan 22, 2024 · 26 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@kwilczynski
Copy link
Member

kwilczynski commented Jan 22, 2024

What happened?

Currently, when the Graceful Node Shutdown is enabled, as the node is shutting down, workloads currently running will be terminated in order of the workload type and priority, while taking into the account the termination grace period that a given workload can have set.

However, even though the DaemonSet type workloads were also successfully vacated from a node that is currently shutting down, there are some other lingering issues with this specific workload type.

When a node shutdown is detected, the Shutdown Manager will then proceed to update the node status to "NotReady" (this does not set the "Unschedulable" flag similarly to what cordoning a node would), adds necessary taints, and then proceeds to terminate currently runnings workloads taking into the account priorities for different type of workloads and also ensuring that crucial system workloads (often run as DaemonSets) are to be terminated the last—attempting to ensure that node will not lose its network access, the metrics can still be sent, and monitoring won't be interrupted to the very end.

Part of the process of preparing the underlying node for shutdown is to add an admission callback that has a singular purpose—reject everything and anything attempting to start on this specific node. The code is very simple and makes no attempts to handle different types of workloads; everything in unison will be rejected during the time when the node is shutting down.

(an example taken from kubernetes/kubernetes/pkg/kubelet/nodeshutdown/nodeshutdown_manager_linux.go#L146-L157)

func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
nodeShuttingDown := m.ShutdownStatus() != nil
if nodeShuttingDown {
return lifecycle.PodAdmitResult{
Admit: false,
Reason: nodeShutdownNotAdmittedReason,
Message: nodeShutdownNotAdmittedMessage,
}
}
return lifecycle.PodAdmitResult{Admit: true}
}

This is where the DaemonSet controller and Graceful Node Shutdown manager disagree—the former sees that its previously managed workloads were removed, assumes that this is an invalid state and attempts to reconcile it starting missing Pods again, whereas the latter sees a new Pod being scheduled on the given node that is currently shutting down, and rejects it (like any other workload of any type) from being run. This cycle will repeat either until the workloads restart policy and errors budget tolerance will allow or indefinitely (or for as long as the node is still pending its shutdown or restart).

(an example taken from kubernetes/kubernetes/pkg/controller/daemon/daemon_controller.go#L1278-L1301)

func NodeShouldRunDaemonPod(node *v1.Node, ds *apps.DaemonSet) (bool, bool) {
pod := NewPod(ds, node.Name)
// If the daemon set specifies a node name, check that it matches with node.Name.
if !(ds.Spec.Template.Spec.NodeName == "" || ds.Spec.Template.Spec.NodeName == node.Name) {
return false, false
}
taints := node.Spec.Taints
fitsNodeName, fitsNodeAffinity, fitsTaints := predicates(pod, node, taints)
if !fitsNodeName || !fitsNodeAffinity {
return false, false
}
if !fitsTaints {
// Scheduled daemon pods should continue running if they tolerate NoExecute taint.
_, hasUntoleratedTaint := v1helper.FindMatchingUntoleratedTaint(taints, pod.Spec.Tolerations, func(t *v1.Taint) bool {
return t.Effect == v1.TaintEffectNoExecute
})
return false, !hasUntoleratedTaint
}
return true, true
}

(an example taken from kubernetes/kubernetes/pkg/controller/daemon/daemon_controller.go#L1304-L1313)

func predicates(pod *v1.Pod, node *v1.Node, taints []v1.Taint) (fitsNodeName, fitsNodeAffinity, fitsTaints bool) {
fitsNodeName = len(pod.Spec.NodeName) == 0 || pod.Spec.NodeName == node.Name
// Ignore parsing errors for backwards compatibility.
fitsNodeAffinity, _ = nodeaffinity.GetRequiredNodeAffinity(pod).Match(node)
_, hasUntoleratedTaint := v1helper.FindMatchingUntoleratedTaint(taints, pod.Spec.Tolerations, func(t *v1.Taint) bool {
return t.Effect == v1.TaintEffectNoExecute || t.Effect == v1.TaintEffectNoSchedule
})
fitsTaints = !hasUntoleratedTaint
return
}

As the worker node can run workloads that take a long time to stop and terminate, the number of workloads with scheduling failure can keep increasing over time, leaving these failed attempts to have to be cleaned manually, such as the "NodeShutdown" (do these type of statuses need manual clean up too?), or even being permanently stuck in the "Pending" or "Terminating" states. Workloads terminated with the "Completed" should, provided that everything works as intended, be cleaned up and rescheduled automatically, especially once the node is back up and running again.

Some other workloads might report scheduling failures, leaving workloads in an "Error" state that would also require manual clean-up.

An example of this taking place during a node shutdown:

Peek.2024-01-17.17-21.webm

What did you expect to happen?

Most of the workload types (Pods, Deployments, ReplicaSets, DaemonSets, etc.) are to be correctly removed from the underlying node, and then the remaining Pods terminated.

How can we reproduce it (as minimally and precisely as possible)?

Only a few steps are needed to reproduce this behaviour:

  • Configure kubelet on a single worker node to enable Graceful Node Shutdown

This can be done using the following kubelet configuration properties:

shutdownGracePeriod: 3600s
shutdownGracePeriodCriticalPods: 1800s

A very simple kubelet configuration file as an example:

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/pki/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 172.17.0.10
clusterDomain: cluster.local
containerRuntimeEndpoint: ""
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMaximumGCAge: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
  flushFrequency: 0
  options:
    json:
      infoBufferSize: "0"
  verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 3600s
shutdownGracePeriodCriticalPods: 1800s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s
  • Prepare a simple test workload (a simple DaemonSet)

An example test.yaml:

---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: test
  namespace: default
  labels:
    app.kubernetes.io/name: test
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: test
  template:
    metadata:
      labels:
        app.kubernetes.io/name: test
    spec:
      nodeSelector:
        kubernetes.io/os: linux
      tolerations:
        - operator: Exists
      containers:
        - name: test
          image: docker.io/library/bash:latest
          command:
            - /bin/sh
            - -c
            - |
              ( while true ; do date ; sleep 1 ; done ) &
              pid=$!
              echo "Starting!"
              _term () {
                  kill $pid
                  echo "Caught SIGTERM!"
              }
              trap _term TERM
              wait $pid
              trap - TERM
              wait $pid
              exit 0
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - |
                    echo BEGIN preStop > /proc/1/fd/1
                    sleep 10
                    echo END preStop > /proc/1/fd/1
      terminationGracePeriodSeconds: 30
      restartPolicy: Always
  • Simulate the node shutdown

Emit the org.freedesktop.login1.Manager.PrepareForShutdown signal to simulate the node being shut down:

An example using the gdbus utility:

# gdbus emit --system --object-path /org/freedesktop/login1 --signal 'org.freedesktop.login1.Manager.PrepareForShutdown' true

You should see the following entries in the kubelet log (provided a sufficient log level is set):

Jan 10 07:28:44 worker-node01 kubelet[6188]: I0110 07:28:44.397678    6188 nodeshutdown_manager_linux.go:265] "Shutdown manager detected new shutdown event, isNodeShuttingDownNow" event=true
Jan 10 07:28:44 worker-node01 kubelet[6188]: I0110 07:28:44.397692    6188 nodeshutdown_manager_linux.go:273] "Shutdown manager detected new shutdown event" event="shutdown"
Jan 10 07:28:44 worker-node01 kubelet[6188]: I0110 07:28:44.397703    6188 nodeshutdown_manager_linux.go:322] "Shutdown manager processing shutdown event"
Jan 10 07:28:44 worker-node01 kubelet[6188]: I0110 07:28:44.398563    6188 event.go:376] "Event occurred" object="worker-node01" fieldPath="" kind="Node" apiVersion="" type="Normal" reason="Shutdown" message="Shutdown manager detected shutdown event"

Plus, kubelet should already be working towards terminating workloads on the given node.

After a few moments, the DaemonSet controller would attempt to schedule workloads, including the test one, back on the worker node that is currently shutting down.

An excerpt from logs captured when this takes place per:

(the affected Pod name is test-cj4n2)

kwilczynski/a194305c0b2324aa1d254cccc40da5b3

(the affected Pod name is node-exporter-9cvc2)

kwilczynski/b2baa49657501c2ab8ad98e09d853ac2

Anything else we need to know?

An important characteristic of a DaemonSet is the ability to ignore most default worker node statuses (such as "NotReady" and "SchedulingDisabled"), ignore the "Unschedulable" flag set on a given node, and also ignore most of the default node taints. The basic idea is to allow workloads run as DaemonSets to run everywhere, including nodes that should not potentially schedule any other types of workloads.

This design decision, which is also the key feature of a DaemonSet, is something that can interfere with the Graceful Node Shutdown feature.

How to fix this problem? Some viable fixes would be:

  • Do not remove DaemonSets from the node that is due to be shut down (not too dissimilar to what cordoning and draining a node would do)
  • Teach the DaemonSet controller about "NodeShutdown" status for a node so that it avoids attempting to schedule Pods there
  • Add a new type of taint and teach the DaemonSet controller to include it as part of its scheduling decision making
  • Teach the Graceful Node Shutdown manager to ignore DaemonSets until the very last moment before these types of workloads are due to be evicted

Some of the above could be controlled via a new setting for kubelet, like the behaviour that controls whether to evict DaemonSets and when.

Additionally, the "Unschedulable" could be set to "true" for a good measure, to ensure that absolutely no workloads will attempt to schedule themselves. Or to make it harder to do so.


Recently, there have been issues with status updates and synchronisation of different types of workloads when a node was shutting down. This has since appears to have been fixed via the following Pull Requests:

However, some newer reports claim the contrary per:

Kubernetes version

  • Kubernetes
$ kubectl version
Client Version: v1.29.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0
  • OpenShift
$ oc version 
Client Version: 4.14.4
Kustomize Version: v5.0.1
Server Version: 4.14.4
Kubernetes Version: v1.27.6+d548052

Cloud provider

N/A

OS version

  • Kubernetes
# On Linux:
$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
$ uname -a
Linux worker-node01 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • OpenShift
# On Linux:
$ cat /etc/os-release 
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="414.92.202311222314-0"
VERSION_ID="4.14"
VARIANT="CoreOS"
VARIANT_ID=coreos
PLATFORM_ID="platform:el9"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 414.92.202311222314-0 (Plow)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:9::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.14/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.14"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.14"
OPENSHIFT_VERSION="4.14"
RHEL_VERSION="9.2"
OSTREE_VERSION="414.92.202311222314-0"

Install tools

  • Kubernetes
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"29", GitVersion:"v1.29.0", GitCommit:"3f7a50f38688eb332e2a1b013678c6435d539ae6", GitTreeState:"clean", BuildDate:"2023-12-13T08:50:10Z", GoVersion:"go1.21.5", Compiler:"gc", Platform:"linux/amd64"}
  • OpenShift
$ openshift-install version
openshift-install 4.14.4
built from commit 82051a1c3d42e394be353bbe2eacf348ccb75970
release image registry.ci.openshift.org/ocp/release@sha256:e6e1d90b492d50438034e6edb46bdafa6c86ae3b80ef3328685912d89681fdee
release architecture amd64

Container runtime (CRI) and version (if applicable)

  • Kubernetes
$ crio --version
crio version 1.29.0
Version:        1.29.0
GitCommit:      d59bbdc252837107c9f5d235b8fb2650ff2b9d93
GitCommitDate:  2023-12-21T16:23:14Z
GitTreeState:   clean
BuildDate:      1970-01-01T00:00:00Z
GoVersion:      go1.21.1
Compiler:       gc
Platform:       linux/amd64
Linkmode:       static
BuildTags:      
  static
  netgo
  osusergo
  exclude_graphdriver_btrfs
  exclude_graphdriver_devicemapper
  seccomp
  apparmor
  selinux
LDFlags:          unknown
SeccompEnabled:   true
AppArmorEnabled:  false
$ conmon --version
conmon version 2.1.10
commit: 09bcded8e9c49cf1ff1fda403feac5a08f22535f-dirty
$ crun --version
crun version 1.12
commit: ce429cb2e277d001c2179df1ac66a470f00802ae
rundir: /run/user/1000/crun
spec: 1.0.0
+SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +YAJL
$ runc --version
runc version 1.1.11
commit: v1.1.11-0-g4bccb38c
spec: 1.0.2-dev
go: go1.20.12
libseccomp: 2.5.4
  • OpenShift
$ crio --version
crio version 1.27.2-2.rhaos4.14.git9d684e2.el9
Version:        1.27.2-2.rhaos4.14.git9d684e2.el9
GitCommit:      unknown
GitCommitDate:  unknown
GitTreeState:   clean
GoVersion:      go1.20.10
Compiler:       gc
Platform:       linux/amd64
Linkmode:       dynamic
BuildTags:      
  rpm_crashtraceback
  libtrust_openssl
  selinux
  seccomp
  exclude_graphdriver_devicemapper
  exclude_graphdriver_btrfs
  containers_image_ostree_stub
LDFlags:           -compressdwarf=false -B 0x8c7c7209f406ead112dcaacb10b18ba6826efc5e -extldflags '-Wl,-z,relro -Wl,--as-needed  -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 '
SeccompEnabled:   true
AppArmorEnabled:  false
$ conmon --version
conmon version 2.1.7
commit: a3013c55c8bfc2d0597ffe8beedadba8aab5bbf6
$ runc --version
runc version 1.1.10
spec: 1.0.2-dev
go: go1.20.10
libseccomp: 2.5.2

Related plugins (CNI, CSI, ...) and versions (if applicable)

  • Kubernetes
$ crictl exec -it ee8826f196d0a calicoctl version
Client Version:    v3.25.0
Git commit:        3f7fe4d29
Cluster Version:   v3.25.0
Cluster Type:      k8s,kdd,bgp,kubeadm
  • OpenShift

A lot of things. Probably too many to list. See the following for details:

@kwilczynski kwilczynski added the kind/bug Categorizes issue or PR as related to a bug. label Jan 22, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 22, 2024
@kwilczynski
Copy link
Member Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 22, 2024
@pacoxu
Copy link
Member

pacoxu commented Jan 23, 2024

/cc @wzshiming @bobbypage

@tuibeovince
Copy link

Thank you for this @kwilczynski

This is very informative for me.

I was just wondering since I believe controllers are owned by sig-apps in general, if there are fixes that needed to be done on DaemonSet controllers, should sig-apps be concerned about this issue or at least the PRs that decide to work on this issue on the DaemonSet side?

@kwilczynski
Copy link
Member Author

@tuibeovince, thank you!

We could use input from the other SIG on the best way to move forward here. Why? Because the DaemonSet controller works correctly and follows the expected behaviour.

There have been discussions in the past about whether the DaemonSet controller should or should observe the current node status or any taints and tolerations set, and then either schedule workloads or not on a given node. However, the current behaviour is what has been agreed upon.

@kannon92
Copy link
Contributor

/sig apps

@k8s-ci-robot k8s-ci-robot added the sig/apps Categorizes an issue or PR as relevant to SIG Apps. label Jan 30, 2024
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Apps Jan 30, 2024
@kannon92
Copy link
Contributor

cc @mimowo

I was discussing this with @kwilczynski. Do you have any insight into this problem? I remember there was some work done on Statefulsets and pod transitions but wasn't sure what happened with DaemonSets.

@mimowo
Copy link
Contributor

mimowo commented Jan 30, 2024

I was discussing this with @kwilczynski. Do you have any insight into this problem?

I don't know about this problem.

I remember there was some work done on Statefulsets and pod transitions but wasn't sure what happened with DaemonSets.

The closest I can think of was #118716, but not sure related this is.

@kwilczynski
Copy link
Member Author

kwilczynski commented Jan 31, 2024

@atiratree
Copy link
Member

Part of the process of preparing the underlying node for shutdown is to add an admission callback that has a singular purpose—reject everything and anything attempting to start on this specific node. The code is very simple and makes no attempts to handle different types of workloads; everything in unison will be rejected during the time when the node is shutting down.

This part seems problematic. The DaemonSets may be running important services (critical priority) that should be available even during shutdown. So I would expect these pods to get admitted, until their priority is being shut down.

Daemon set controller's mission is to reconcile the current state (DaemonSet, Node, Pod) to achieve the desired state. There is not enough information to instruct the daemon set controller when you introduce a simple condition or taint. The controller needs to know which DaemonSets with what priorities should run on each node and reconcile this accordingly.

Teach the Graceful Node Shutdown manager to ignore DaemonSets until the very last moment before these types of workloads are due to be evicted

This might be surprising for users or you would need additional configuration for kubelet. And it could double the shutdown time.

I think, to fix this, there would have to be a signal about the progress of the graceful shutdown (by priority).

Do not remove DaemonSets from the node that is due to be shut down (not too dissimilar to what cordoning and draining a node would do)

This is under discussion, and it is possible that we will add DaemonSet removal to the node drain flow as part of the newly discussed Declarative Node Maintenance KEP.

https://github.com/kubernetes/enhancements/pull/4213/files#diff-a3337600d2065dcf21be514dcf3c3c62499168a12ae2e4f1aa2db55fddaddbecR223

Details
  1. Graceful deletion of DaemonSet pods is currently only supported as part of (Linux) graceful node
    shutdown. The length of the shutdown is again not application specific and is set cluster-wide
    (optionally by priority) by the cluster admin. This does not take into account
    .spec.terminationGracePeriodSeconds of each pod and may cause premature termination of
    the application. This has been discussed in issue kubernetes/kubernetes#75482
    and in issue kubernetes-sigs/cluster-api#6158.
  2. There are cases during a node shutdown, when data corruption can occur due to premature node
    shutdown. It would be great if applications could perform data migration and synchronization of
    cached writes to the underlying storage before the pod deletion occurs. This is not easy to
    quantify even with pod's .spec.shutdownGracePeriod, as the time depends on the size of the data
    and the speed of the storage. This has been discussed in issue kubernetes/kubernetes#116618
    and in issue kubernetes/kubernetes#115148.

It is also part of the non-goals section https://github.com/kubernetes/enhancements/blob/0e401d1e3fe2eae82b9e876b8bf30a6b7ef8ffba/keps/sig-apps/4212-declarative-node-maintenance/README.md#non-goals

Details
  • Introduce a field that could include non-critical daemon set pods (priority system-cluster-critical or system-node-critical) for node maintenance/drain request. The daemon set controller would then gracefully shut down these pods. Critical pods could be overridden by the priority list mentioned below.
  • NodeMaintenance could include a plan of which pods to target first. Similar to graceful node shutdown, we could include a list of priorities to decide which pods should be terminated first. This list could optionally include pod timeouts, but could also wait for all the pods of a given priority class to finish first without a timeout. This could also be used to target daemon set pods of certain priorities (see point above). We could also introduce drain profiles based on these lists. The cluster admin could then choose or create such a profile based on his/her needs. The logic for processing the decision list would be contained in the node maintenance controller, which would set an intent to selected pods to shut down via the EvacuationRequest condition.

Similar issues occur when also doing the node drain. Kubelet could offload this functionality to the NodeMaintenance in the future and just create such an object when graceful termination is observed. The node maintenance controller would sync the status of NodeMaintenance according to which priorities/types of pods are eligible to be running on each node. The daemon set controller could then react to this and coordinate pods for both node shutdown and node maintenance.

@kannon92
Copy link
Contributor

kannon92 commented Feb 2, 2024

So my general understanding of the problem is that daemonSets are ignoring the normal taints/tolerations (by design) and they continue to run pods.

Kubelet does have the possibility of setting conditions on the node. This is used to set pressure on the node. Eviction manager sets a few node conditions in eviction.go (really it sets an array of conditions) and this does get populated to the node status.

I wonder if one possible solution would be to introduce a new nodeStatusCondition for gracefulshutdown. If the gracefulshutdown controller sets a condition on the node, then we could potentially use that to force our designed behavior on a DaemonSet.

This part seems problematic. The DaemonSets may be running important services (critical priority) that should be available even during shutdown. So I would expect these pods to get admitted, until their priority is being shut down.

This is already possible with the gracefulshutdown by pod priority.

@atiratree
Copy link
Member

This is already possible with the gracefulshutdown by pod priority.

I mean, it should be possible to accept a new critical priority workload even during a shutdown, if the low priority workloads are still running/shutting down. As mentioned above, the code does not seem to do this.

Details

func (m *managerImpl) Admit(attrs *lifecycle.PodAdmitAttributes) lifecycle.PodAdmitResult {
nodeShuttingDown := m.ShutdownStatus() != nil
if nodeShuttingDown {
return lifecycle.PodAdmitResult{
Admit: false,
Reason: nodeShutdownNotAdmittedReason,
Message: nodeShutdownNotAdmittedMessage,
}
}
return lifecycle.PodAdmitResult{Admit: true}
}

For example, you might have inconsistent behavior/bugs in the following case:

Normal flow

  1. daemonset application with critical priority is rolling out a new image on all nodes
  2. daemonset pod is terminated on node A
  3. new revision of daemonset pod is started on node A
  4. node A is being turned off and graceful shutdown is started
  5. the daemonset appplication runs properly during the graceful shutdown

Non-graceful flow

  1. daemonset application with critical priority is rolling out a new image on all nodes
  2. daemonset pod is terminated on node A
  3. node A is being turned off and graceful shutdown is started
  4. new revision of daemonset pod cannot be started on node A
  5. outage of the daemonset application on node A during the graceful shutdown

@kwilczynski
Copy link
Member Author

This is already possible with the gracefulshutdown by pod priority.

I mean, it should be possible to accept a new critical priority workload even during a shutdown, if the low priority workloads are still running/shutting down. As mentioned above, the code does not seem to do this.

@atiratree, this is as intended. The code will terminate workloads in priority order (there is support for this), but it will not allow any new workloads, no matter what their status or priority might be, to be placed on the underlying node.

This simplicity avoids dealing with various edge cases, such as the one you outlined.

@atiratree
Copy link
Member

atiratree commented Feb 6, 2024

This simplicity avoids dealing with various edge cases, such as the one you outlined.

I think the simplicity can hurt some of the workloads. It also goes against the declarative nature of DaemonSets and the Kubelet configuration.

It would be good to get some feedback/use cases from people running these DaemonSets, before graduating this feature to GA.

@kubernetes/sig-storage-leads are you okay with critical priority DaemonSets being unavailable/terminated before the normal priority user workloads are terminated during a graceful shutdown? (please see #122912 (comment))

@kwilczynski
Copy link
Member Author

[...]

@kubernetes/sig-storage-leads are you okay with critical priority DaemonSets being unavailable/terminated before the normal priority user workloads are terminated during a graceful shutdown?

I believe we are terminating critical and highest priority workloads last, starring with the lowest ones, per:

func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup {
groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority))
for _, period := range shutdownGracePeriodByPodPriority {
groups = append(groups, podShutdownGroup{
ShutdownGracePeriodByPodPriority: period,
})
}
for _, pod := range pods {
var priority int32
if pod.Spec.Priority != nil {
priority = *pod.Spec.Priority
}
// Find the group index according to the priority.
index := sort.Search(len(groups), func(i int) bool {
return groups[i].Priority >= priority
})
// 1. Those higher than the highest priority default to the highest priority
// 2. Those lower than the lowest priority default to the lowest priority
// 3. Those boundary priority default to the lower priority
// if priority of pod is:
// groups[index-1].Priority <= pod priority < groups[index].Priority
// in which case we want to pick lower one (i.e index-1)
if index == len(groups) {
index = len(groups) - 1
} else if index < 0 {
index = 0
} else if index > 0 && groups[index].Priority > priority {
index--
}
groups[index].Pods = append(groups[index].Pods, pod)
}
return groups
}

See:

Unless you had something else in mind?

@atiratree
Copy link
Member

I meant this as a continuation of our thread about the edge cases, not the main/standard flow.

@ndixita
Copy link
Contributor

ndixita commented Feb 7, 2024

/cc @ndixita

@ndixita
Copy link
Contributor

ndixita commented Feb 7, 2024

/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed kind/bug Categorizes issue or PR as related to a bug. labels Feb 7, 2024
@tuibeovince
Copy link

I believe we are terminating critical and highest priority workloads last, starring with the lowest ones, per:

func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup {
groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority))
for _, period := range shutdownGracePeriodByPodPriority {
groups = append(groups, podShutdownGroup{
ShutdownGracePeriodByPodPriority: period,
})
}
for _, pod := range pods {
var priority int32
if pod.Spec.Priority != nil {
priority = *pod.Spec.Priority
}
// Find the group index according to the priority.
index := sort.Search(len(groups), func(i int) bool {
return groups[i].Priority >= priority
})
// 1. Those higher than the highest priority default to the highest priority
// 2. Those lower than the lowest priority default to the lowest priority
// 3. Those boundary priority default to the lower priority
// if priority of pod is:
// groups[index-1].Priority <= pod priority < groups[index].Priority
// in which case we want to pick lower one (i.e index-1)
if index == len(groups) {
index = len(groups) - 1
} else if index < 0 {
index = 0
} else if index > 0 && groups[index].Priority > priority {
index--
}
groups[index].Pods = append(groups[index].Pods, pod)
}
return groups
}

A point of curiosity, but does this mean that the normal GracefulNodeShutdown does not necessarily look into pod's priority? Shouldn't something like that be a more default behavior and a non-priority based Graceful shutdown be an option?

@kwilczynski
Copy link
Member Author

kwilczynski commented Feb 8, 2024

I believe we are terminating critical and highest priority workloads last, starring with the lowest ones, per:

func groupByPriority(shutdownGracePeriodByPodPriority []kubeletconfig.ShutdownGracePeriodByPodPriority, pods []*v1.Pod) []podShutdownGroup {
groups := make([]podShutdownGroup, 0, len(shutdownGracePeriodByPodPriority))
for _, period := range shutdownGracePeriodByPodPriority {
groups = append(groups, podShutdownGroup{
ShutdownGracePeriodByPodPriority: period,
})
}
for _, pod := range pods {
var priority int32
if pod.Spec.Priority != nil {
priority = *pod.Spec.Priority
}
// Find the group index according to the priority.
index := sort.Search(len(groups), func(i int) bool {
return groups[i].Priority >= priority
})
// 1. Those higher than the highest priority default to the highest priority
// 2. Those lower than the lowest priority default to the lowest priority
// 3. Those boundary priority default to the lower priority
// if priority of pod is:
// groups[index-1].Priority <= pod priority < groups[index].Priority
// in which case we want to pick lower one (i.e index-1)
if index == len(groups) {
index = len(groups) - 1
} else if index < 0 {
index = 0
} else if index > 0 && groups[index].Priority > priority {
index--
}
groups[index].Pods = append(groups[index].Pods, pod)
}
return groups
}

A point of curiosity, but does this mean that the normal GracefulNodeShutdown does not necessarily look into pod's priority? Shouldn't something like that be a more default behavior and a non-priority based Graceful shutdown be an option?

This is the default behaviour. There are two concepts here:

  • Ability to set up different graceful shutdown periods for Pods of different priorities (this is not the default)
  • Shutdown Pods based on their priority, from lowest to highest

If no dedicated graceful shutdown periods are provided for Pods of different priorities, then the same graceful shutdown period is used for everything (something that the user also configures in order to enable this feature).

@tuibeovince
Copy link

If no dedicated graceful shutdown periods are provided for Pods of different priorities, then then the same graceful shutdown period is used for everything (something that the user also configures in order to enable this feature).

I still am trying to understand but, are pod priorities only dependent on the individual grace-period settings of each pod?

@kwilczynski
Copy link
Member Author

If no dedicated graceful shutdown periods are provided for Pods of different priorities, then then the same graceful shutdown period is used for everything (something that the user also configures in order to enable this feature).

I still am trying to understand but, are pod priorities only dependent on the individual grace-period settings of each pod?

Feel free to peruse the following KEP that was dedicated to this feature:

@kannon92
Copy link
Contributor

/triage accepted
/priority important-longterm

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 13, 2024
@kwilczynski
Copy link
Member Author

To clarify:

This issue does not impact DaemonSets and any other workloads post-startup (or after the node has been brought back up after it had been shut down). Problems described here pertain only to the time when the Graceful Node Shutdown code detects a restart or shutdown signal while the node is in the process of changing its state (to either restart or power off).

@tuibeovince
Copy link

Normal flow

  1. daemonset application with critical priority is rolling out a new image on all nodes
  2. daemonset pod is terminated on node A
  3. new revision of daemonset pod is started on node A
  4. node A is being turned off and graceful shutdown is started
  5. the daemonset appplication runs properly during the graceful shutdown

Non-graceful flow

  1. daemonset application with critical priority is rolling out a new image on all nodes
  2. daemonset pod is terminated on node A
  3. node A is being turned off and graceful shutdown is started
  4. new revision of daemonset pod cannot be started on node A
  5. outage of the daemonset application on node A during the graceful shutdown

Is there an ideal graceful flow for handling Daemonset pods at this moment?

Would it be something like:

Pattern A:

  1. daemonset application with critical priority is rolling out a new image on all nodes
  2. daemonset pods are given a critical priority to be terminated last
  3. node A is being turned off and graceful shutdown is started
  4. node A drains as normal ignoring daemonsets, daemonset pods not terminated and nothing new starts on node A
  5. the daemonset appplication runs properly during the graceful shutdown until the very last moment

or

Pattern B:

  1. daemonset application with critical priority is rolling out a new image on all nodes
  2. daemonset pod is terminated on node A
  3. nodeshutdown manager freely allows new revision of daemonset pod to start on node A
  4. node A is being turned off and graceful shutdown is started; no other attempts of deleting daemonset pods in node A
  5. the daemonset appplication runs properly during the graceful shutdown until node shuts down

@kwilczynski
Copy link
Member Author

kwilczynski commented May 7, 2024

@tuibeovince, the DaemonSet scheduled pods should not be running on a node that is shutting down and has the Graceful Node Shutdown feature enabled.

There was never an option, at least as per the KEP, to allow certain types of workloads to remain running while the node is shutting down. So, no, we aren't handling DaemonSet pods; we treat these as any other pods, subject to the same approach to termination as other pods.

@kwilczynski
Copy link
Member Author

/unassign @kwilczynski

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Status: Needs Triage
Development

No branches or pull requests

8 participants