Support for DaemonSet eviction when draining nodes #6158

ailurarctos · 2022-02-17T00:32:42Z

(I'm not sure if this feature request is large enough to require the CAEP process. If it is please let me know.)

User Story

As a user I would like to some mechanism to have my DaemonSet pods gracefully terminated when draining nodes for deletion so that those pods can complete their shutdown process.

Detailed Description

Currently Cluster API uses the standard kubectl drain ignoring all DaemonSets (link). I would like some way to have my DaemonSet pods also gracefully terminated as part of the node deletion process.

Anything else you would like to add:

While investigating whether this is currently possible I saw that Cluster Autoscaler provides a mechanism to control DaemonSet draining. I'm planning to make use of this in the interim but it would be nice to also have the draining happen for when nodes are not drained by Cluster Autoscaler (e.g. for cluster upgrades, etc.).

I also looked into the graceful node shutdown feature but in my case the pod drain time is quite long (could be 30 minutes or longer) and I'm not sure the feature would work for such long termination times, especially in EC2. I don't think EC2 will let you stall instance termination for such a long time. It's hard to find any documentation on how long an EC2 instance can inhibit the shutdown but I did see this saying typically 10 minutes is the max.

The other thing I saw while investigating this is that Cluster API machine deletion has a pre-terminate hook. It seems like it might be possible to implement evicting DaemonSet pods by making a custom Hook Implementing Controller (HIC). Is that the preferred way to implement something like this? If so I can close this feature request and look into making the HIC.

/kind feature

kfox1111 · 2022-02-17T00:34:53Z

Kind of sounds like kubernetes/kubernetes#75482 :/

ailurarctos · 2022-02-17T01:02:18Z

Kind of sounds like kubernetes/kubernetes#75482 :/

Yes, I think if kubernetes/kubernetes#75482 were implemented it could potentially be used to implement this feature request.

vincepri · 2022-02-17T04:29:42Z

/milestone Next
/kind proposal

k8s-triage-robot · 2022-05-18T04:47:42Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-17T05:40:40Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ailurarctos · 2022-07-08T05:54:17Z

/remove-lifecycle rotten

k8s-triage-robot · 2022-10-27T20:06:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

fabriziopandini · 2022-11-03T15:39:13Z

/lifecycle frozen
based on experience, we are slowly surfacing knobs for machine deletion/drain, and this falls into this category. As documented above this could require a small proposal

/help

k8s-ci-robot · 2022-11-03T15:39:14Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/lifecycle frozen
based on experience, we are slowly surfacing knobs for machine deletion/drain, and this falls into this category. As documented above this could require a small proposal

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fabriziopandini · 2022-11-30T17:51:22Z

/triage accepted

atiratree · 2024-01-10T21:49:07Z

This feature might be eventually supported with Declarative Node Maintenance: kubernetes/enhancements#4213

fabriziopandini · 2024-04-12T14:24:42Z

/priority backlog

sbueringer · 2024-09-30T12:30:32Z

Do we know how cluster-autoscaler implemented this feature?

In general the DaemonSet controller will add a toleration for the Unschedulable taint to all DaemonSet Pods (https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#taints-and-tolerations).

So while it's possible to evict DaemonSet Pods they will just immediately be re-created (because "cordon" basically doesn't work because of the toleration).

I would guess they maybe added a cluster-autoscaler-specific taint to the Node?

In general it would be better if evicting DaemonSet Pods would be cleanly supported in core Kubernetes first.

chrischdi · 2024-10-01T13:07:28Z

Took a quick look at autoscalers code.

For me it looks like they don't handle that the daemonset controller schedules a new pod.
They ignore that fact but seem to evict the running pods once and seem to have the race that:

if it was successful it just continues deletion (not listing pods again, just the existing ones before evicting ds pods are gone)
if it was not successful it tries again, with then maybe other pods

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/actuation/group_deletion_scheduler.go#L100-L116

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Feb 17, 2022

k8s-ci-robot added the kind/proposal Issues or PRs related to proposals. label Feb 17, 2022

k8s-ci-robot added this to the Next milestone Feb 17, 2022

furkatgofurov7 mentioned this issue Mar 11, 2022

CAPI waiting forever for the volume to be detached #6285

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 8, 2022

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the Next milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini mentioned this issue Aug 5, 2022

Improve API around machine deletion strategy #7021

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 27, 2022

k8s-ci-robot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Nov 30, 2022

atiratree mentioned this issue Feb 2, 2024

DaemonSet controller and Graceful Node Shutdown manager disagree when making workloads placement decision kubernetes/kubernetes#122912

Open

k8s-ci-robot added the priority/backlog Higher priority than priority/awaiting-more-evidence. label Apr 12, 2024

typeid mentioned this issue Jun 3, 2024

OCPBUGS-34650: Allow specifying the volume detach timeout for machines via NodePools openshift/hypershift#4136

Merged

4 tasks

sbueringer self-assigned this Sep 24, 2024

sbueringer mentioned this issue Sep 30, 2024

Configurable Machine drain behavior #11240

Closed

6 tasks

sbueringer removed their assignment Oct 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for DaemonSet eviction when draining nodes #6158

Support for DaemonSet eviction when draining nodes #6158

ailurarctos commented Feb 17, 2022

kfox1111 commented Feb 17, 2022

ailurarctos commented Feb 17, 2022

vincepri commented Feb 17, 2022

k8s-triage-robot commented May 18, 2022

k8s-triage-robot commented Jun 17, 2022

ailurarctos commented Jul 8, 2022

k8s-triage-robot commented Oct 27, 2022

fabriziopandini commented Nov 3, 2022

k8s-ci-robot commented Nov 3, 2022

fabriziopandini commented Nov 30, 2022

atiratree commented Jan 10, 2024

fabriziopandini commented Apr 12, 2024

sbueringer commented Sep 30, 2024 •

edited

Loading

chrischdi commented Oct 1, 2024 •

edited

Loading

Support for DaemonSet eviction when draining nodes #6158

Support for DaemonSet eviction when draining nodes #6158

Comments

ailurarctos commented Feb 17, 2022

kfox1111 commented Feb 17, 2022

ailurarctos commented Feb 17, 2022

vincepri commented Feb 17, 2022

k8s-triage-robot commented May 18, 2022

k8s-triage-robot commented Jun 17, 2022

ailurarctos commented Jul 8, 2022

k8s-triage-robot commented Oct 27, 2022

fabriziopandini commented Nov 3, 2022

k8s-ci-robot commented Nov 3, 2022

Guidelines

fabriziopandini commented Nov 30, 2022

atiratree commented Jan 10, 2024

fabriziopandini commented Apr 12, 2024

sbueringer commented Sep 30, 2024 • edited Loading

chrischdi commented Oct 1, 2024 • edited Loading

sbueringer commented Sep 30, 2024 •

edited

Loading

chrischdi commented Oct 1, 2024 •

edited

Loading