Skip to content

Commit 75e1096

Browse files
atiratreerthallisey
andcommitted
Add Node Lifecycle WG
Co-authored-by: Ryan Hallisey <[email protected]>
1 parent f49183a commit 75e1096

File tree

15 files changed

+212
-0
lines changed

15 files changed

+212
-0
lines changed

OWNERS_ALIASES

+3
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,9 @@ aliases:
142142
- jeremyrickard
143143
- liggitt
144144
- micahhausler
145+
wg-node-lifecycle-leads:
146+
- atiratree
147+
- rthallisey
145148
wg-policy-leads:
146149
- JimBugwadia
147150
- poonam-lamba

communication/slack-config/channels.yaml

+1
Original file line numberDiff line numberDiff line change
@@ -584,6 +584,7 @@ channels:
584584
- name: wg-multitenancy
585585
- name: wg-naming
586586
archived: true
587+
- name: wg-node-lifecycle
587588
- name: wg-onprem
588589
archived: true
589590
- name: wg-policy

sig-apps/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
5959
The following [working groups][working-group-definition] are sponsored by sig-apps:
6060
* [WG Batch](/wg-batch)
6161
* [WG Data Protection](/wg-data-protection)
62+
* [WG Node Lifecycle](/wg-node-lifecycle)
6263
* [WG Serving](/wg-serving)
6364

6465

sig-architecture/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ The Chairs of the SIG run operations and processes governing the SIG.
5858
The following [working groups][working-group-definition] are sponsored by sig-architecture:
5959
* [WG Device Management](/wg-device-management)
6060
* [WG LTS](/wg-lts)
61+
* [WG Node Lifecycle](/wg-node-lifecycle)
6162
* [WG Policy](/wg-policy)
6263
* [WG Serving](/wg-serving)
6364
* [WG Structured Logging](/wg-structured-logging)

sig-autoscaling/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,7 @@ The Chairs of the SIG run operations and processes governing the SIG.
4848
The following [working groups][working-group-definition] are sponsored by sig-autoscaling:
4949
* [WG Batch](/wg-batch)
5050
* [WG Device Management](/wg-device-management)
51+
* [WG Node Lifecycle](/wg-node-lifecycle)
5152
* [WG Serving](/wg-serving)
5253

5354

sig-cli/README.md

+6
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,12 @@ subprojects, and resolve cross-subproject technical issues and decisions.
6060
- [@kubernetes/sig-cli-test-failures](https://github.com/orgs/kubernetes/teams/sig-cli-test-failures) - Test Failures and Triage
6161
- Steering Committee Liaison: Paco Xu 徐俊杰 (**[@pacoxu](https://github.com/pacoxu)**)
6262

63+
## Working Groups
64+
65+
The following [working groups][working-group-definition] are sponsored by sig-cli:
66+
* [WG Node Lifecycle](/wg-node-lifecycle)
67+
68+
6369
## Subprojects
6470

6571
The following [subprojects][subproject-definition] are owned by sig-cli:

sig-cloud-provider/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
5858
## Working Groups
5959

6060
The following [working groups][working-group-definition] are sponsored by sig-cloud-provider:
61+
* [WG Node Lifecycle](/wg-node-lifecycle)
6162
* [WG Structured Logging](/wg-structured-logging)
6263

6364

sig-cluster-lifecycle/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
5252

5353
The following [working groups][working-group-definition] are sponsored by sig-cluster-lifecycle:
5454
* [WG LTS](/wg-lts)
55+
* [WG Node Lifecycle](/wg-node-lifecycle)
5556
* [WG etcd Operator](/wg-etcd-operator)
5657

5758

sig-list.md

+1
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ When the need arises, a [new SIG can be created](sig-wg-lifecycle.md)
6666
|[Device Management](wg-device-management/README.md)|[device-management](https://github.com/kubernetes/kubernetes/labels/wg%2Fdevice-management)|* Architecture<br>* Autoscaling<br>* Network<br>* Node<br>* Scheduling<br>|* [John Belamaric](https://github.com/johnbelamaric), Google<br>* [Kevin Klues](https://github.com/klueska), NVIDIA<br>* [Patrick Ohly](https://github.com/pohly), Intel<br>|* [Slack](https://kubernetes.slack.com/messages/wg-device-management)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-device-management)|* Regular WG Meeting: [Tuesdays at 8:30 PT (Pacific Time) (biweekly)](TBD)<br>
6767
|[etcd Operator](wg-etcd-operator/README.md)|[etcd-operator](https://github.com/kubernetes/kubernetes/labels/wg%2Fetcd-operator)|* Cluster Lifecycle<br>* etcd<br>|* [Benjamin Wang](https://github.com/ahrtr), VMware<br>* [Ciprian Hacman](https://github.com/hakman), Microsoft<br>* [Josh Berkus](https://github.com/jberkus), Red Hat<br>* [James Blair](https://github.com/jmhbnz), Red Hat<br>* [Justin Santa Barbara](https://github.com/justinsb), Google<br>|* [Slack](https://kubernetes.slack.com/messages/wg-etcd-operator)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-etcd-operator)|* Regular WG Meeting: [Tuesdays at 11:00 PT (Pacific Time) (bi-weekly)](https://zoom.us/my/cncfetcdproject)<br>
6868
|[LTS](wg-lts/README.md)|[lts](https://github.com/kubernetes/kubernetes/labels/wg%2Flts)|* Architecture<br>* Cluster Lifecycle<br>* K8s Infra<br>* Release<br>* Security<br>* Testing<br>|* [Jeremy Rickard](https://github.com/jeremyrickard), Microsoft<br>* [Jordan Liggitt](https://github.com/liggitt), Google<br>* [Micah Hausler](https://github.com/micahhausler), Amazon<br>|* [Slack](https://kubernetes.slack.com/messages/wg-lts)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-lts)|* Regular WG Meeting: [Tuesdays at 07:00 PT (Pacific Time) (biweekly)](https://zoom.us/j/92480197536?pwd=dmtSMGJRQmNYYTIyZkFlQ25JRngrdz09)<br>
69+
|[Node Lifecycle](wg-node-lifecycle/README.md)|[node-lifecycle](https://github.com/kubernetes/kubernetes/labels/wg%2Fnode-lifecycle)|* Apps<br>* Architecture<br>* Autoscaling<br>* CLI<br>* Cloud Provider<br>* Cluster Lifecycle<br>* Node<br>* Scheduling<br>|* [Filip Křepinský](https://github.com/atiratree), Red Hat<br>* [Ryan Hallisey](https://github.com/rthallisey), NVIDIA<br>|* [Slack](https://kubernetes.slack.com/messages/wg-node-lifecycle)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle)|* WG Node Lifecycle Weekly Meeting: [s at (weekly)]()<br>
6970
|[Policy](wg-policy/README.md)|[policy](https://github.com/kubernetes/kubernetes/labels/wg%2Fpolicy)|* Architecture<br>* Auth<br>* Multicluster<br>* Network<br>* Node<br>* Scheduling<br>* Storage<br>|* [Jim Bugwadia](https://github.com/JimBugwadia), Kyverno/Nirmata<br>* [Poonam Lamba](https://github.com/poonam-lamba), Google<br>* [Andy Suderman](https://github.com/sudermanjr), Fairwinds<br>|* [Slack](https://kubernetes.slack.com/messages/wg-policy)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-policy)|* Regular WG Meeting: [Wednesdays at 8:00 PT (Pacific Time) (semimonthly)](https://zoom.us/j/7375677271)<br>
7071
|[Serving](wg-serving/README.md)|[serving](https://github.com/kubernetes/kubernetes/labels/wg%2Fserving)|* Apps<br>* Architecture<br>* Autoscaling<br>* Instrumentation<br>* Network<br>* Node<br>* Scheduling<br>* Storage<br>|* [Eduardo Arango](https://github.com/ArangoGutierrez), NVIDIA<br>* [Jiaxin Shan](https://github.com/Jeffwan), Bytedance<br>* [Sergey Kanzhelev](https://github.com/SergeyKanzhelev), Google<br>* [Yuan Tang](https://github.com/terrytangyuan), Red Hat<br>|* [Slack](https://kubernetes.slack.com/messages/wg-serving)<br>* [Mailing List](https://groups.google.com/a/kubernetes.io/g/wg-serving)|* WG Serving Weekly Meeting ([calendar](https://calendar.google.com/calendar/embed?src=e896b769743f3877edfab2d4c6a14132b2aa53287021e9bbf113cab676da54ba%40group.calendar.google.com)): [Wednesdays at 9:00 PT (Pacific Time) (weekly)](https://zoom.us/j/92615874244?pwd=VGhxZlJjRTNRWTZIS0dQV2MrZUJ5dz09)<br>
7172
|[Structured Logging](wg-structured-logging/README.md)|[structured-logging](https://github.com/kubernetes/kubernetes/labels/wg%2Fstructured-logging)|* API Machinery<br>* Architecture<br>* Cloud Provider<br>* Instrumentation<br>* Network<br>* Node<br>* Scheduling<br>* Storage<br>|* [Mengjiao Liu](https://github.com/mengjiao-liu), Independent<br>* [Patrick Ohly](https://github.com/pohly), Intel<br>|* [Slack](https://kubernetes.slack.com/messages/wg-structured-logging)<br>* [Mailing List](https://groups.google.com/forum/#!forum/kubernetes-wg-structured-logging)|

sig-node/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
5555
The following [working groups][working-group-definition] are sponsored by sig-node:
5656
* [WG Batch](/wg-batch)
5757
* [WG Device Management](/wg-device-management)
58+
* [WG Node Lifecycle](/wg-node-lifecycle)
5859
* [WG Policy](/wg-policy)
5960
* [WG Serving](/wg-serving)
6061
* [WG Structured Logging](/wg-structured-logging)

sig-scheduling/README.md

+1
Original file line numberDiff line numberDiff line change
@@ -67,6 +67,7 @@ subprojects, and resolve cross-subproject technical issues and decisions.
6767
The following [working groups][working-group-definition] are sponsored by sig-scheduling:
6868
* [WG Batch](/wg-batch)
6969
* [WG Device Management](/wg-device-management)
70+
* [WG Node Lifecycle](/wg-node-lifecycle)
7071
* [WG Policy](/wg-policy)
7172
* [WG Serving](/wg-serving)
7273
* [WG Structured Logging](/wg-structured-logging)

sigs.yaml

+39
Original file line numberDiff line numberDiff line change
@@ -3697,6 +3697,45 @@ workinggroups:
36973697
liaison:
36983698
github: saschagrunert
36993699
name: Sascha Grunert
3700+
- dir: wg-node-lifecycle
3701+
name: Node Lifecycle
3702+
mission_statement: >
3703+
Explore and improve node and pod lifecycle in Kubernetes. This should result in
3704+
better node drain/maintenance support and better pod disruption/termination. It
3705+
should also improve node and pod autoscaling, better application migration and
3706+
availability, load balancing, de/scheduling, node shutdown, cloud provider integrations,
3707+
and support other new scenarios and integrations.
3708+
3709+
charter_link: charter.md
3710+
stakeholder_sigs:
3711+
- Apps
3712+
- Architecture
3713+
- Autoscaling
3714+
- CLI
3715+
- Cloud Provider
3716+
- Cluster Lifecycle
3717+
- Node
3718+
- Scheduling
3719+
label: node-lifecycle
3720+
leadership:
3721+
chairs:
3722+
- github: atiratree
3723+
name: Filip Křepinský
3724+
company: Red Hat
3725+
3726+
- github: rthallisey
3727+
name: Ryan Hallisey
3728+
company: NVIDIA
3729+
3730+
meetings:
3731+
- description: WG Node Lifecycle Weekly Meeting
3732+
day: ""
3733+
time: ""
3734+
tz: ""
3735+
frequency: weekly
3736+
contact:
3737+
slack: wg-node-lifecycle
3738+
mailing_list: https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle
37003739
- dir: wg-policy
37013740
name: Policy
37023741
mission_statement: >

wg-node-lifecycle/OWNERS

+8
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# See the OWNERS docs at https://go.k8s.io/owners
2+
3+
reviewers:
4+
- wg-node-lifecycle-leads
5+
approvers:
6+
- wg-node-lifecycle-leads
7+
labels:
8+
- wg/node-lifecycle

wg-node-lifecycle/README.md

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
<!---
2+
This is an autogenerated file!
3+
4+
Please do not edit this file directly, but instead make changes to the
5+
sigs.yaml file in the project root.
6+
7+
To understand how this file is generated, see https://git.k8s.io/community/generator/README.md
8+
--->
9+
# Node Lifecycle Working Group
10+
11+
Explore and improve node and pod lifecycle in Kubernetes. This should result in better node drain/maintenance support and better pod disruption/termination. It should also improve node and pod autoscaling, better application migration and availability, load balancing, de/scheduling, node shutdown, cloud provider integrations, and support other new scenarios and integrations.
12+
13+
The [charter](charter.md) defines the scope and governance of the Node Lifecycle Working Group.
14+
15+
## Stakeholder SIGs
16+
* [SIG Apps](/sig-apps)
17+
* [SIG Architecture](/sig-architecture)
18+
* [SIG Autoscaling](/sig-autoscaling)
19+
* [SIG CLI](/sig-cli)
20+
* [SIG Cloud Provider](/sig-cloud-provider)
21+
* [SIG Cluster Lifecycle](/sig-cluster-lifecycle)
22+
* [SIG Node](/sig-node)
23+
* [SIG Scheduling](/sig-scheduling)
24+
25+
## Meetings
26+
*Joining the [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle) for the group will typically add invites for the following meetings to your calendar.*
27+
* WG Node Lifecycle Weekly Meeting: [s at ]() (weekly). [Convert to your timezone](http://www.thetimezoneconverter.com/?t=&tz=).
28+
29+
## Organizers
30+
31+
* Filip Křepinský (**[@atiratree](https://github.com/atiratree)**), Red Hat
32+
* Ryan Hallisey (**[@rthallisey](https://github.com/rthallisey)**), NVIDIA
33+
34+
## Contact
35+
- Slack: [#wg-node-lifecycle](https://kubernetes.slack.com/messages/wg-node-lifecycle)
36+
- [Mailing list](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle)
37+
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fnode-lifecycle)
38+
<!-- BEGIN CUSTOM CONTENT -->
39+
40+
<!-- END CUSTOM CONTENT -->

wg-node-lifecycle/charter.md

+107
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
# WG Node Lifecycle Charter
2+
3+
This charter adheres to the conventions described in the [Kubernetes Charter README] and uses
4+
the Roles and Organization Management outlined in [wg-governance].
5+
6+
[Kubernetes Charter README]: /committee-steering/governance/README.md
7+
8+
## Scope
9+
10+
The Kubernetes ecosystem currently faces challenges in node maintenance scenarios, with multiple
11+
projects independently addressing similar issues. The goal of this working group is to develop
12+
unified APIs that the entire ecosystem can depend on, reducing the maintenance burden across
13+
projects and addressing scenarios that impede node drain or cause improper pod termination. Our
14+
objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with
15+
existing APIs and behaviors.
16+
17+
To properly solve the node drain, we must first understand the node lifecycle. This includes
18+
provisioning/sunsetting of the nodes, PodDisruptionBudgets, API-initiated eviction and node
19+
shutdown. This then impacts both the node and pod autoscaling, load balancing, and the applications
20+
running in the cluster. All of these areas have issues and would benefit from a unified approach.
21+
22+
### In scope
23+
24+
- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs
25+
and extending the current ones. This includes exploring extension to or interactions with the Node
26+
object.
27+
- Analyze the node lifecycle, the Node API, and possible interactions. We want to explore augmenting
28+
the Node API to expose additional state or status in order to coalesce other core Kubernetes and
29+
community APIs around node lifecycle management.
30+
- Improve the disruption model that is currently implemented by API-initiated Eviction API and PDBs.
31+
Improve the descheduling, availability and migration capabilities of today's application
32+
workloads. Also explore the interactions with other eviction mechanisms.
33+
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.
34+
- Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and
35+
the new disruption model/evictions.
36+
- Explore the cloud provider use cases and how they can hook in into the node lifecycle. So that the
37+
users can use the same APIs or configurations across the board.
38+
- Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter,
39+
...) to use the new approach.
40+
41+
42+
### Out of scope
43+
44+
- Implementing cloud provider specific logic, the goal is to have high-level API that the providers
45+
can use, hook into, or extend.
46+
- Infrastructure provisioning, deprovisioning solution or physical infrastructure lifecycle
47+
management solution.
48+
49+
## Stakeholders
50+
51+
- SIG Apps
52+
- SIG Architecture
53+
- SIG Autoscaling
54+
- SIG CLI
55+
- SIG Cloud Provider
56+
- SIG Cluster Lifecycle
57+
- SIG Node
58+
- SIG Scheduling
59+
60+
Stakeholders span from multiple SIGs to a broad set of end users,
61+
public and private cloud providers, Kubernetes distribution providers,
62+
and cloud provider end-users. Here are some user stories:
63+
64+
- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without
65+
any required manual interventions. I also want to be able to observe the node drain via the API
66+
and check on its progress. I also want to be able to discover workloads that are blocking the node
67+
drain.
68+
- To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet
69+
and other actors should use a new eviction API to gracefully remove pods. This should enable new
70+
migration strategies that prefer to surge (upscale) pods first rather than downscale them. It
71+
should also allow other users/components to monitor pods that are gracefully removed/terminated
72+
and provide better behaviour in terms of de/scheduling, scaling and availability.
73+
- As an end user, I cannot bear the cost of blue-green upgrades, especially with special hardware
74+
accelerators; it's far too expensive. It is more cost-effective to coordinate a drain and then
75+
upgrade.
76+
- As a cloud provider, I need to perform regular maintenance on the hardware in my fleet. Enhancing
77+
Kubernetes to help CSPs safely remove hardware will reduce operational costs.
78+
- Modelling the cost of doing accelerator maintenance in today's world can be massive. And since
79+
hardware accelerators tend to need more love and care, having software support to coordinate
80+
maintenance will reduce operational costs.
81+
82+
## Deliverables
83+
84+
The WG will coordinate requirement gatherthing and design, eventually leading to
85+
KEP(s)s and code associated with the ideas.
86+
87+
Area we expect to explore:
88+
89+
- An API to express node drain/maintenance.
90+
Currently tracked in https://github.com/kubernetes/enhancements/issues/4563.
91+
- An API to solve the problems wrt the API-initiated Eviction API and PDBs.
92+
Currently tracked in https://github.com/kubernetes/enhancements/issues/4212
93+
- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve
94+
wide range of issue.
95+
96+
## Roles and Organization Management
97+
98+
This WG adheres to the Roles and Organization Management outlined in [wg-governance]
99+
and opts-in to updates and modifications to [wg-governance].
100+
101+
[wg-governance]: /committee-steering/governance/wg-governance.md
102+
103+
## Timelines and Disbanding
104+
105+
The working group will disband when the KEPs we create are completed. We will
106+
review whether the working group should disband if appropriate SIG ownership
107+
can't be reached.

0 commit comments

Comments
 (0)