Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Introduce Node Lifecycle WG #8396

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions OWNERS_ALIASES
Original file line number Diff line number Diff line change
@@ -142,6 +142,11 @@ aliases:
- jeremyrickard
- liggitt
- micahhausler
wg-node-lifecycle-leads:
- atiratree
- fabriziopandini
- humblec
- rthallisey
wg-policy-leads:
- JimBugwadia
- poonam-lamba
1 change: 1 addition & 0 deletions communication/slack-config/channels.yaml
Original file line number Diff line number Diff line change
@@ -584,6 +584,7 @@ channels:
- name: wg-multitenancy
- name: wg-naming
archived: true
- name: wg-node-lifecycle
- name: wg-onprem
archived: true
- name: wg-policy
1 change: 1 addition & 0 deletions liaisons.md
Original file line number Diff line number Diff line change
@@ -59,6 +59,7 @@ members will assume one of the departing members groups.
| [WG Device Management](wg-device-management/README.md) | Patrick Ohly (**[@pohly](https://github.com/pohly)**) |
| [WG etcd Operator](wg-etcd-operator/README.md) | Maciej Szulik (**[@soltysh](https://github.com/soltysh)**) |
| [WG LTS](wg-lts/README.md) | Sascha Grunert (**[@saschagrunert](https://github.com/saschagrunert)**) |
| [WG Node Lifecycle](wg-node-lifecycle/README.md) | TBD (**[@TBD](https://github.com/TBD)**) |
| [WG Policy](wg-policy/README.md) | Patrick Ohly (**[@pohly](https://github.com/pohly)**) |
| [WG Serving](wg-serving/README.md) | Maciej Szulik (**[@soltysh](https://github.com/soltysh)**) |
| [WG Structured Logging](wg-structured-logging/README.md) | Sascha Grunert (**[@saschagrunert](https://github.com/saschagrunert)**) |
1 change: 1 addition & 0 deletions sig-apps/README.md
Original file line number Diff line number Diff line change
@@ -59,6 +59,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
The following [working groups][working-group-definition] are sponsored by sig-apps:
* [WG Batch](/wg-batch)
* [WG Data Protection](/wg-data-protection)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)


1 change: 1 addition & 0 deletions sig-architecture/README.md
Original file line number Diff line number Diff line change
@@ -58,6 +58,7 @@ The Chairs of the SIG run operations and processes governing the SIG.
The following [working groups][working-group-definition] are sponsored by sig-architecture:
* [WG Device Management](/wg-device-management)
* [WG LTS](/wg-lts)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Policy](/wg-policy)
* [WG Serving](/wg-serving)
* [WG Structured Logging](/wg-structured-logging)
1 change: 1 addition & 0 deletions sig-autoscaling/README.md
Original file line number Diff line number Diff line change
@@ -48,6 +48,7 @@ The Chairs of the SIG run operations and processes governing the SIG.
The following [working groups][working-group-definition] are sponsored by sig-autoscaling:
* [WG Batch](/wg-batch)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Serving](/wg-serving)


6 changes: 6 additions & 0 deletions sig-cli/README.md
Original file line number Diff line number Diff line change
@@ -60,6 +60,12 @@ subprojects, and resolve cross-subproject technical issues and decisions.
- [@kubernetes/sig-cli-test-failures](https://github.com/orgs/kubernetes/teams/sig-cli-test-failures) - Test Failures and Triage
- Steering Committee Liaison: Paco Xu 徐俊杰 (**[@pacoxu](https://github.com/pacoxu)**)

## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-cli:
* [WG Node Lifecycle](/wg-node-lifecycle)


## Subprojects

The following [subprojects][subproject-definition] are owned by sig-cli:
1 change: 1 addition & 0 deletions sig-cloud-provider/README.md
Original file line number Diff line number Diff line change
@@ -58,6 +58,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
## Working Groups

The following [working groups][working-group-definition] are sponsored by sig-cloud-provider:
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Structured Logging](/wg-structured-logging)


1 change: 1 addition & 0 deletions sig-cluster-lifecycle/README.md
Original file line number Diff line number Diff line change
@@ -52,6 +52,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-cluster-lifecycle:
* [WG LTS](/wg-lts)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG etcd Operator](/wg-etcd-operator)


1 change: 1 addition & 0 deletions sig-list.md
Original file line number Diff line number Diff line change
@@ -66,6 +66,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md)
|[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture<br>* Autoscaling<br>* Network<br>* Node<br>* Scheduling<br>|* [John Belamaric](https://github.com/johnbelamaric), Google<br>* [Kevin Klues](https://github.com/klueska), NVIDIA<br>* [Patrick Ohly](https://github.com/pohly), Intel<br>|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting: [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](TBD)<br>
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle<br>* etcd<br>|* [Benjamin Wang](https://github.com/ahrtr), VMware<br>* [Ciprian Hacman](https://github.com/hakman), Microsoft<br>* [Josh Berkus](https://github.com/jberkus), Red Hat<br>* [James Blair](https://github.com/jmhbnz), Red Hat<br>* [Justin Santa Barbara](https://github.com/justinsb), Google<br>|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)<br>
|[LTS](wg-lts/README.md)|[lts](https://github.com/kubernetes/kubernetes/labels/wg%2Flts)|* Architecture<br>* Cluster Lifecycle<br>* K8s Infra<br>* Release<br>* Security<br>* Testing<br>|* [Jeremy Rickard](https://github.com/jeremyrickard), Microsoft<br>* [Jordan Liggitt](https://github.com/liggitt), Google<br>* [Micah Hausler](https://github.com/micahhausler), Amazon<br>|* [Slack](https://kubernetes.slack.com/messages/wg-lts)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-lts)|* Regular WG Meeting: [Tuesdays at 07:00 PT (Pacific Time) (biweekly)](https://zoom.us/j/92480197536?pwd=dmtSMGJRQmNYYTIyZkFlQ25JRngrdz09)<br>
|[Node Lifecycle](wg-node-lifecycle/README.md)|[node-lifecycle](https://github.com/kubernetes/kubernetes/labels/wg%2Fnode-lifecycle)|* Apps<br>* Architecture<br>* Autoscaling<br>* CLI<br>* Cloud Provider<br>* Cluster Lifecycle<br>* Network<br>* Node<br>* Scheduling<br>* Storage<br>|* [Filip Křepinský](https://github.com/atiratree), Red Hat<br>* [Fabrizio Pandini](https://github.com/fabriziopandini), VMware<br>* [Humble Chirammal](https://github.com/humblec), VMware<br>* [Ryan Hallisey](https://github.com/rthallisey), NVIDIA<br>|* [Slack](https://kubernetes.slack.com/messages/wg-node-lifecycle)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle)|* WG Node Lifecycle Weekly Meeting: [TBDs at TBD TBD (weekly)]()<br>
|[Policy](wg-policy/README.md)|[policy](https://github.com/kubernetes/kubernetes/labels/wg%2Fpolicy)|* Architecture<br>* Auth<br>* Multicluster<br>* Network<br>* Node<br>* Scheduling<br>* Storage<br>|* [Jim Bugwadia](https://github.com/JimBugwadia), Kyverno/Nirmata<br>* [Poonam Lamba](https://github.com/poonam-lamba), Google<br>* [Andy Suderman](https://github.com/sudermanjr), Fairwinds<br>|* [Slack](https://kubernetes.slack.com/messages/wg-policy)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-policy)|* Regular WG Meeting: [Wednesdays at 8:00 PT (Pacific Time) (semimonthly)](https://zoom.us/j/7375677271)<br>
|[Serving](wg-serving/README.md)|[serving](https://github.com/kubernetes/kubernetes/labels/wg%2Fserving)|* Apps<br>* Architecture<br>* Autoscaling<br>* Instrumentation<br>* Network<br>* Node<br>* Scheduling<br>* Storage<br>|* [Eduardo Arango](https://github.com/ArangoGutierrez), NVIDIA<br>* [Jiaxin Shan](https://github.com/Jeffwan), Bytedance<br>* [Sergey Kanzhelev](https://github.com/SergeyKanzhelev), Google<br>* [Yuan Tang](https://github.com/terrytangyuan), Red Hat<br>|* [Slack](https://kubernetes.slack.com/messages/wg-serving)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-serving)|* WG Serving Weekly Meeting ([calendar](https://calendar.google.com/calendar/embed?src=e896b769743f3877edfab2d4c6a14132b2aa53287021e9bbf113cab676da54ba%40group.calendar.google.com)): [Wednesdays at 9:00 PT (Pacific Time) (weekly)](https://zoom.us/j/92615874244?pwd=VGhxZlJjRTNRWTZIS0dQV2MrZUJ5dz09)<br>
|[Structured Logging](wg-structured-logging/README.md)|[structured-logging](https://github.com/kubernetes/kubernetes/labels/wg%2Fstructured-logging)|* API Machinery<br>* Architecture<br>* Cloud Provider<br>* Instrumentation<br>* Network<br>* Node<br>* Scheduling<br>* Storage<br>|* [Mengjiao Liu](https://github.com/mengjiao-liu), Independent<br>* [Patrick Ohly](https://github.com/pohly), Intel<br>|* [Slack](https://kubernetes.slack.com/messages/wg-structured-logging)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-structured-logging)|
1 change: 1 addition & 0 deletions sig-network/README.md
Original file line number Diff line number Diff line change
@@ -70,6 +70,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-network:
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Policy](/wg-policy)
* [WG Serving](/wg-serving)
* [WG Structured Logging](/wg-structured-logging)
1 change: 1 addition & 0 deletions sig-node/README.md
Original file line number Diff line number Diff line change
@@ -55,6 +55,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
The following [working groups][working-group-definition] are sponsored by sig-node:
* [WG Batch](/wg-batch)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Policy](/wg-policy)
* [WG Serving](/wg-serving)
* [WG Structured Logging](/wg-structured-logging)
1 change: 1 addition & 0 deletions sig-scheduling/README.md
Original file line number Diff line number Diff line change
@@ -67,6 +67,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
The following [working groups][working-group-definition] are sponsored by sig-scheduling:
* [WG Batch](/wg-batch)
* [WG Device Management](/wg-device-management)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Policy](/wg-policy)
* [WG Serving](/wg-serving)
* [WG Structured Logging](/wg-structured-logging)
1 change: 1 addition & 0 deletions sig-storage/README.md
Original file line number Diff line number Diff line change
@@ -59,6 +59,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.

The following [working groups][working-group-definition] are sponsored by sig-storage:
* [WG Data Protection](/wg-data-protection)
* [WG Node Lifecycle](/wg-node-lifecycle)
* [WG Policy](/wg-policy)
* [WG Serving](/wg-serving)
* [WG Structured Logging](/wg-structured-logging)
52 changes: 52 additions & 0 deletions sigs.yaml
Original file line number Diff line number Diff line change
@@ -3697,6 +3697,58 @@ workinggroups:
liaison:
github: saschagrunert
name: Sascha Grunert
- dir: wg-node-lifecycle
name: Node Lifecycle
mission_statement: >
Explore and improve node and pod lifecycle in Kubernetes. This should result in
better node drain/maintenance support and better pod disruption/termination. It
should also improve node and pod autoscaling, better application migration and
availability, load balancing, de/scheduling, node shutdown, cloud provider integrations,
and support other new scenarios and integrations.
charter_link: charter.md
stakeholder_sigs:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are missing network, that owns loadbalancing, endpoints, ... and already has serious problems because of known issues like https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 added the SIG and the doc link, we need to address the loadbalancing and endpoints as well

- Apps
- Architecture
- Autoscaling
- CLI
- Cloud Provider
- Cluster Lifecycle
- Network
- Node
- Scheduling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont we need to add other infra SIGs like storage as stakeholders too ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 added, this list is not final and we will have conversations with each SIG after the KubeCon if they want to be part of this WG

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know SIG Storage is impacted as well, I am just not sure how much cooperation is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @atiratree 👍 .. I would like to help on this effort anything related to storage.. Tagging leads from storage here.. ( @xing-yang @msau42 @jsafrane @saad-ali )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the help @humblec! Added.

- Storage
label: node-lifecycle
leadership:
chairs:
- github: atiratree
name: Filip Křepinský
company: Red Hat
email: [email protected]
- github: fabriziopandini
name: Fabrizio Pandini
company: VMware
email: [email protected]
- github: humblec
name: Humble Chirammal
company: VMware
email: [email protected]
- github: rthallisey
name: Ryan Hallisey
company: NVIDIA
email: [email protected]
meetings:
- description: WG Node Lifecycle Weekly Meeting
day: TBD
time: TBD
tz: TBD
frequency: weekly
contact:
slack: wg-node-lifecycle
mailing_list: https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle
liaison:
github: TBD
name: TBD
- dir: wg-policy
name: Policy
mission_statement: >
8 changes: 8 additions & 0 deletions wg-node-lifecycle/OWNERS
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# See the OWNERS docs at https://go.k8s.io/owners

reviewers:
- wg-node-lifecycle-leads
approvers:
- wg-node-lifecycle-leads
labels:
- wg/node-lifecycle
45 changes: 45 additions & 0 deletions wg-node-lifecycle/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
<!---
This is an autogenerated file!
Please do not edit this file directly, but instead make changes to the
sigs.yaml file in the project root.
To understand how this file is generated, see https://git.k8s.io/community/generator/README.md
--->
# Node Lifecycle Working Group

Explore and improve node and pod lifecycle in Kubernetes. This should result in better node drain/maintenance support and better pod disruption/termination. It should also improve node and pod autoscaling, better application migration and availability, load balancing, de/scheduling, node shutdown, cloud provider integrations, and support other new scenarios and integrations.

The [charter](charter.md) defines the scope and governance of the Node Lifecycle Working Group.

## Stakeholder SIGs
* [SIG Apps](/sig-apps)
* [SIG Architecture](/sig-architecture)
* [SIG Autoscaling](/sig-autoscaling)
* [SIG CLI](/sig-cli)
* [SIG Cloud Provider](/sig-cloud-provider)
* [SIG Cluster Lifecycle](/sig-cluster-lifecycle)
* [SIG Network](/sig-network)
* [SIG Node](/sig-node)
* [SIG Scheduling](/sig-scheduling)
* [SIG Storage](/sig-storage)

## Meetings
*Joining the [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle) for the group will typically add invites for the following meetings to your calendar.*
* WG Node Lifecycle Weekly Meeting: [TBDs at TBD TBD]() (weekly). [Convert to your timezone](http://www.thetimezoneconverter.com/?t=TBD&tz=TBD).

## Organizers

* Filip Křepinský (**[@atiratree](https://github.com/atiratree)**), Red Hat
* Fabrizio Pandini (**[@fabriziopandini](https://github.com/fabriziopandini)**), VMware
* Humble Chirammal (**[@humblec](https://github.com/humblec)**), VMware
* Ryan Hallisey (**[@rthallisey](https://github.com/rthallisey)**), NVIDIA

## Contact
- Slack: [#wg-node-lifecycle](https://kubernetes.slack.com/messages/wg-node-lifecycle)
- [Mailing list](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle)
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fnode-lifecycle)
- Steering Committee Liaison: TBD (**[@TBD](https://github.com/TBD)**)
<!-- BEGIN CUSTOM CONTENT -->

<!-- END CUSTOM CONTENT -->
160 changes: 160 additions & 0 deletions wg-node-lifecycle/charter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# WG Node Lifecycle Charter

This charter adheres to the conventions described in the [Kubernetes Charter README] and uses
the Roles and Organization Management outlined in [wg-governance].

[Kubernetes Charter README]: /committee-steering/governance/README.md

## Scope

The Kubernetes ecosystem currently faces challenges in node maintenance scenarios, with multiple
projects independently addressing similar issues. The goal of this working group is to develop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we list out the references of the projects if possible ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be too heavy for the WG declaration. We have a partial list in https://github.com/atiratree/kube-enhancements/blob/improve-node-maintenance/keps/sig-apps/4212-declarative-node-maintenance/README.md#motivation and I expect other projects to be discussed/added in the future.

I would prefer to track it separately once the WG is established. Let's also see what others think. If there is a demand I can add it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to maybe listing this out. Or at least a link to where we can find more information

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have built an internal version, some others are draino and medik8s

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's add the list here as a starting point then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new Relevant Projects section. Please let me know if you know of others, or want to somehow reference your internal ones.

unified APIs that the entire ecosystem can depend on, reducing the maintenance burden across
projects and addressing scenarios that impede node drain or cause improper pod termination. Our
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the plan to add a unified API which control the kubernetes features in this area ? ( like Priority classes, PDBs..etc? )
If Pod scaling ..etc are in consideration , do we also need to consider Resource Quota and Auto scalers like HPA here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, autoscaling (HPA), scheduling and the API objects you have listed are in scope. Currently, it is not yet clear which existing APIs will be affected, but we will take them into account.

Btw, I am not sure why the thread about the references of the projects is not shown? Here is the link https://github.com/kubernetes/community/pull/8396/files#r2016483839

objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the goal is APIs to use by solutions or implement a solution? These two sentences seems to be at odds. Maybe mention that k8s has no plans blocking customers implementing advanced use cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to do both :) I have added another sentence there to explain it better.

existing APIs and behaviors. We will strive to make these solutions minimalistic and extensible to
support advanced use cases across the ecosystem.

To properly solve the node drain, we must first understand the node lifecycle. This includes
provisioning/sunsetting of the nodes, PodDisruptionBudgets, API-initiated eviction and node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we specifically include topology spread constraints in this list as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduling is certainly an important part as well. I have added a mention of scheduling constraints to our goals.

shutdown. This then impacts both the node and pod autoscaling, de/scheduling, load balancing, and
the applications running in the cluster. All of these areas have issues and would benefit from a
unified approach.

### In scope
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another goal should include making Pods work reliably while terminating. This is important since with prolifiration of non-live migratable VMs with accelerators, we see more and more situations when maintanence-caused termination should be taking hours if not days.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I have added a two more stories. All in all, the In scope section covers this in general, I hope.


- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a goal of migrating existing scenarios to the new API so the group will be tasked to not break users when they are upgrading

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have that in scope under

Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter), and other scenarios to use the new approach.

So far it is pretty generic until we have a clearer vision. Please let me know if you would like to see something more specific.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth including something about DRA device taints/drains?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems relevant to me as this affects the pod and device/node lifecycle. @pohly what do you think about including and discussing kubernetes/enhancements#5055 in the WG?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node.

So I guess it depends?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what I understand about device taints, they are a way we can make device health scheduler-aware. This fits into our scope because we need a way to decide what Node should be prioritized for maintenance and a plan to drain that Node. E.g. a Node with all its devices tainted is a great target for Node maintenance. However, I think we should hold onto the DRA device taints feature for when we discuss Node Maintenance and Node Drain designs. I don't think we need it called out in the scope as Node Maintenance and Node Drain should cover it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also like to handle the pod lifecycle better in any descheduling scenario (not just Node Drain/Maintenance). One option is to use the EvictionRequest API which should give more power to the applications that are being disrupted. So it might be interesting to see if we can make the disruption more graceful in 5055.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node.

The other end of this to consider though is devices that span multiple nodes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added it to the WG. I think it would be good to discuss this feature and its impact . Also, it might be better to hold off beta for some time.

and extending the current ones. This includes exploring extension to or interactions with the Node
object.
- Analyze the node lifecycle, the Node API, and possible interactions. We want to explore augmenting
the Node API to expose additional state or status in order to coalesce other core Kubernetes and
community APIs around node lifecycle management.
- Improve the disruption model that is currently implemented by API-initiated Eviction API and PDBs.
Improve the descheduling, availability and migration capabilities of today's application
workloads. Also explore the interactions with other eviction mechanisms.
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.
To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000)
feature to GA and resolve the associated node shutdown issues.
- Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and
the new disruption model/evictions. This includes balancing of the pods according to scheduling
constraints.
- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance.
- Explore the cloud provider use cases and how they can hook in into the node lifecycle. So that the
users can use the same APIs or configurations across the board.
- Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter,
...) and other scenarios to use the new approach.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Make it more clear that when you say "new approach" you're referring to "a unified way of draining the nodes" you mentioned in the first bullet point. Either move this sentence up or clarify it here.

- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to
Copy link

@ivelichkovich ivelichkovich Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like everyone is solving the problem of node maintenance independently and building private in house solutions. Improving the drain behavior is one aspect of maintenance (generally the first step after detection). There's additional steps once a node is ready to be acted on that everyone seems to have an in house solution for (especially for people serving accelerated infra).

An example might be a system that drains the node and then reboots it when a GPU fault is detected. That's just one example, the system should be able to take arbitrary actions based on various signals after waiting for a signal the node is all good to work on. Maybe some controller like "when you see X state create arbitrary CR Y" then users can extend the controller for Y to take whatever remediation action they want such as reboot / reset gpu drivers / reset NICs / etc

It seems like it would be good to come up with a community solution for how to take these actions after a node is drained and ready to be worked on. Thoughts on including this in the wg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we want to include these considerations in the WG. We imply these in our goals, but I have added your suggestion as an additional user story to make it clearer.

track and react to each of them. Consider past discussions/historical perspective
(e.g. "thumbstones").

### Out of scope

- Implementing cloud provider specific logic, the goal is to have high-level API that the providers
can use, hook into, or extend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- Infrastructure provisioning, deprovisioning solution or physical infrastructure lifecycle
management solution.

## Stakeholders

- SIG Apps
- SIG Architecture
- SIG Autoscaling
- SIG CLI
- SIG Cloud Provider
- SIG Cluster Lifecycle
- SIG Network
- SIG Node
- SIG Scheduling
- SIG Storage

Stakeholders span from multiple SIGs to a broad set of end users,
public and private cloud providers, Kubernetes distribution providers,
and cloud provider end-users. Here are some user stories:

- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a goal to explore the scenario of getting the historical perspective on why the node was terminated/drained/killed. This comes up very often and maybe we can help those scenarios in this WG. Various ideas like Node object "thumbstones" were discussed in the past.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure I fully understand this. I have added a new point to the In scope section that mentions this. Feel free to write a GitHub suggestion.

any required manual interventions. I also want to be able to observe the node drain via the API
and check on its progress. I also want to be able to discover workloads that are blocking the node
drain.
- To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet
and other actors should use a new eviction API to gracefully remove pods. This should enable new
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: change "should" to something like "I want to" given that the KEP hasn't been accepted yet

migration strategies that prefer to surge (upscale) pods first rather than downscale them. It
should also allow other users/components to monitor pods that are gracefully removed/terminated
and provide better behaviour in terms of de/scheduling, scaling and availability.
- As a cluster admin, I want to be able to perform arbitrary actions after the node drain is
complete, such as resetting GPU drivers, resetting NICs, performing software updates or shutting
down the machine.
- As an end user, I would like more alternatives to blue-green upgrades, especially with special
hardware accelerators; it's far too expensive. I would like to choose a strategy on how to
coordinate the node drain and the upgrade to achieve better cost-effectiveness.
- As a cloud provider, I need to perform regular maintenance on the hardware in my fleet. Enhancing
Kubernetes to help CSPs safely remove hardware will reduce operational costs.
- Modelling the cost of doing accelerator maintenance in today's world can be massive. And since
hardware accelerators tend to need more love and care, having software support to coordinate
maintenance will reduce operational costs.
- As a cluster admin, I would like to use a mixture of on-demand and temporary spot instances in my
clusters to reduce cloud expenditure. Having more reliable lifecycle and drain mechanisms for
nodes will improve cluster stability in scenarios where instances may be terminated by the cloud
provider due to cost-related thresholds.
- As a user, I want to prevent any disruption to my pet or expensive workloads (VMs, ML with
accelerators) and either prevent termination altogether or have a reliable migration path.
Features like `terminationGracePeriodSeconds` are not sufficient as the termination/migration can
take hours if not days.
Comment on lines +98 to +101
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can users not already do this with a PDB? Are we suggesting that node maintenance would override blocking PDBs if they block for some extended period of time?

I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PDBs have a different use-case, so we may need to reword. The PodDisruptionBudget protects the availability of the application. What we're saying is that there's no API that protects both the availability of the infrastructure and the availability of the application. E.g. an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish. It's in the best interest of the application and infrastructure provider that an admin remediates the accelerator, so admin-user mutually agree on when that can occur.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree there's a problem that eviction API / drain doesn't guarantee it will finish within a reasonable time, especially if the node is having issues (things get stuck terminating etc.).

But this at least we can do today right?

an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish

You can just taint the node or the devices with NoSchedule.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just taint the node or the devices with NoSchedule.

Yes, that is a solution assuming that:

  1. workloads will eventually drain if we wait long enough
  2. all termination steps will be successful

However, 1) can theoretically always work but it is the slowest possible solution and 2) is not guaranteed to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload?

This is a good example. We want the applications/admins be aware of upcoming maintenances. But also pods in most of descheduling scenarios so that they are given opportunity to migrate or cleanup before the termination, which is hard to do with PDBs.

The goal is not to override PDBs (it is also hard to do without breaking someone), the goal is to have a smarter layer above the PDBs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eviction API has been used and if the terminationGracePeriod passed, isnt there a force kill executed ? Is there a scenario where the pod get stuck for days/weeks in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the eviction API alone can cause pods to get stuck. All in all, I would prefer we do not dive deep into the topic and focus mostly on the scope in this PR.

Copy link
Contributor

@humblec humblec Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. ooc, can you please share any reference issues where eviction API itself get stuck? while I understand the scope to an extent of this WG, I feel , the need to have a new WG itself arised due to the lack of co-ordination or complexity of the Primitive/features we have in K/K around scheduling, Preemption and Eviction.. imo, with the new design if we still not address half of the issues on this area, it dont serve the purpose.

- As a user, I want my application to finish all network and storage operations before terminating a
pod. This includes closing pod connections, removing pods from endpoints, writing cached writes
to the underlying storage and completing storage cleanup routines.
Comment on lines +102 to +104
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some scenarios, evicting/removing certain pods would prevent some of these operations. Consider if we were also evicting daemonset pods as an option (is that a goal I wasn't sure?), then we might need ordering somehow to make sure the CSI driver or CNI driver aren't removed until certain other cleanup has happened

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we handle this is our internal drain API has a label selector for things to ignore and by default just ignores daemonsets instead of worrying about the ordering here. Daemonsets aren't really supported under drain anyway. That should handle most cases for system level things like csi/cni/etc.

Copy link
Member Author

@atiratree atiratree Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, daemonset pods should be considered as part of the maintenance scenarios, especially in cases when the node is going to shutdown. I have added it to the goals to make it clear.

We also had discussions with SIG node about static pod termination in the past, and they were generally not against it. But we lack use cases for it so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, this proposal aim to define an order for static pod termination, DaemonSet pods or other system-node-critical Priority class Pod termination in node drain scenario. Is that a correct assumption ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying that we will solve static pod termination, just that we will look into that :) And yes, I think there should definitely be an ordering for both DS/Static.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think another user story is around the use of ephemeral low cost instances on cloud providers. eg

As a cluster admin, I would like to use a mixture of on-demand and temporary spot instances in  my clusters to reduce cloud expenditure. Having more reliable lifecycle and drain mechanisms for nodes will improve cluster stability in scenarios where instances may be terminated by the cloud provider due to cost-related thresholds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this story is also important to have. Thanks!

## Deliverables

The WG will coordinate requirement gathering and design, eventually leading to
KEP(s)s and code associated with the ideas.

Area we expect to explore:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node Lifecycle around accelerators always comes up. Is there consideration in this group to explore these areas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although I am not sure if there will be deliverables specifically targeting accelerators yet. We mention them in the user stories and will consider them when creating the APIs.


- An API to express node drain/maintenance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem I feel we will need to address is how to transition existing drain logic in various components to this new API. Having a new API without migrating old ways to this new API create "yet another" way to do it and requires end user to understand more draining logics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still no upgrade path, since we have not even agreed on the solution. We don't want to break people using the current approaches. The main incentive to switch should be painless upgrades/maintenance and other benefits.

I expect that the main components/users that use the kubectl(-like) drain should not have a hard time using the new solution(s). However, I am not sure what it will look like for the GNS, for example.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why graceful termination that mentioned above is not listed here? Mosty curious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I have fixed that. I am also open to other KEPs/documents that people would like to include here.

Currently tracked in https://github.com/kubernetes/enhancements/issues/4212.
- An API to solve the problems wrt the API-initiated Eviction API and PDBs.
Currently tracked in https://github.com/kubernetes/enhancements/issues/4563.
- An API/mechanism to gracefully terminate pods during a node shutdown.
Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000.
- An API to deschedule pods that use DRA devices.
DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055.
- An API to remove pods from endpoints before they terminate.
Currently tracked in https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y.
- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve
wide range of issue.

We expect to provide reference implementations of the new APIs including but not limited to
controllers, API validation, integration with existing core components and extension points for the
ecosystem. This should be accompanied by E2E / Conformance tests.

## Relevant Projects

This is a list of known projects that solve similar problems in the ecosystem or would benefit from
the efforts of this WG:

- https://github.com/aws/aws-node-termination-handler
- https://github.com/foriequal0/pod-graceful-drain
- https://github.com/kubereboot/kured
- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
- https://github.com/kubernetes-sigs/karpenter
- https://github.com/kubevirt/kubevirt
- https://github.com/medik8s/node-maintenance-operator
- https://github.com/Mellanox/maintenance-operator
- https://github.com/openshift/machine-config-operator
- https://github.com/planetlabs/draino
- https://github.com/strimzi/drain-cleaner

There are also internal custom solutions that companies use.

## Roles and Organization Management

This WG adheres to the Roles and Organization Management outlined in [wg-governance]
and opts-in to updates and modifications to [wg-governance].

[wg-governance]: /committee-steering/governance/wg-governance.md

## Timelines and Disbanding

The working group will disband when the KEPs we create are completed. We will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The working group will disband when the KEPs we create are completed. We will
The working group will disband once the core APIs defined in the KEPs have reached a stable state (GA) and ongoing maintenance ownership is established within the relevant SIGs. We will

review whether the working group should disband if appropriate SIG ownership
can't be reached.