WIP: Introduce Node Lifecycle WG #8396

atiratree · 2025-03-24T12:17:05Z

No description provided.

k8s-ci-robot · 2025-03-24T12:17:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign parispittman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

atiratree · 2025-03-24T13:48:19Z

/hold

rthallisey · 2025-03-24T14:04:33Z

Looks like I'm not a member of kubernetes org anymore. I was a few years back, but didn't keep up with contributions recently. You can remove me as a lead and I can reapply after some contributions to this WG.

atiratree · 2025-03-24T14:59:20Z

We have had impactful conversations with Ryan about this group and its goals. He has experience with cluster maintenance and I look forward to his participation in the WG.

marquiz · 2025-03-25T17:09:04Z

/cc

ajaysundark · 2025-03-25T22:11:50Z

/cc

hakman · 2025-03-27T13:22:24Z

/cc

elmiko · 2025-03-27T18:19:11Z

/cc

jackfrancis · 2025-03-27T21:35:15Z

/cc

SergeyKanzhelev · 2025-03-27T22:29:04Z

wg-node-lifecycle/charter.md

+projects independently addressing similar issues. The goal of this working group is to develop
+unified APIs that the entire ecosystem can depend on, reducing the maintenance burden across
+projects and addressing scenarios that impede node drain or cause improper pod termination. Our
+objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with


So the goal is APIs to use by solutions or implement a solution? These two sentences seems to be at odds. Maybe mention that k8s has no plans blocking customers implementing advanced use cases

We want to do both :) I have added another sentence there to explain it better.

SergeyKanzhelev · 2025-03-27T22:30:32Z

wg-node-lifecycle/charter.md

+
+### In scope
+
+- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs


Can you please add a goal of migrating existing scenarios to the new API so the group will be tasked to not break users when they are upgrading

We do have that in scope under

Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter), and other scenarios to use the new approach.

So far it is pretty generic until we have a clearer vision. Please let me know if you would like to see something more specific.

SergeyKanzhelev · 2025-03-27T22:33:17Z

wg-node-lifecycle/charter.md

+shutdown. This then impacts both the node and pod autoscaling, load balancing, and the applications
+running in the cluster. All of these areas have issues and would benefit from a unified approach.
+
+### In scope


Another goal should include making Pods work reliably while terminating. This is important since with prolifiration of non-live migratable VMs with accelerators, we see more and more situations when maintanence-caused termination should be taking hours if not days.

Good idea, I have added a two more stories. All in all, the In scope section covers this in general, I hope.

SergeyKanzhelev · 2025-03-27T22:35:22Z

wg-node-lifecycle/charter.md

+public and private cloud providers, Kubernetes distribution providers,
+and cloud provider end-users. Here are some user stories:
+
+- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without


Can we add a goal to explore the scenario of getting the historical perspective on why the node was terminated/drained/killed. This comes up very often and maybe we can help those scenarios in this WG. Various ideas like Node object "thumbstones" were discussed in the past.

I am not really sure I fully understand this. I have added a new point to the In scope section that mentions this. Feel free to write a GitHub suggestion.

SergeyKanzhelev · 2025-03-27T22:37:59Z

wg-node-lifecycle/charter.md

+  accelerators; it's far too expensive. It is more cost-effective to coordinate a drain and then
+  upgrade.


this is a strong statement. Maybe we can say it more generically: Investigate the most cost effective ways to upgrade nodes with the expensive accelerators deployed on it.

The blue-green upgrade with accelerators use-case I think is an important one to mention in some way. Drain and upgrade always have a cost - money, time, complexity, ect.... We're trying to say that the current ecosystem of APIs and Tools, or lack thereof, cause solutions to be more "expensive" than it should be. We can rephrase to emphasize this.

Perhaps it would be better to focus on some specific examples showing the "cost". E.g. some people want to keep specific accelerators they have been using because of ware and tear.

My concern mostly about wording. Even more cost effective is to force kill and recreate. Drain is not always the best path, depending on the workload

Ack, we have changed the wording.

SergeyKanzhelev · 2025-03-27T22:41:58Z

wg-node-lifecycle/charter.md

+
+Area we expect to explore:
+
+- An API to express node drain/maintenance.


One problem I feel we will need to address is how to transition existing drain logic in various components to this new API. Having a new API without migrating old ways to this new API create "yet another" way to do it and requires end user to understand more draining logics

There is still no upgrade path, since we have not even agreed on the solution. We don't want to break people using the current approaches. The main incentive to switch should be painless upgrades/maintenance and other benefits.

I expect that the main components/users that use the kubectl(-like) drain should not have a hard time using the new solution(s). However, I am not sure what it will look like for the GNS, for example.

SergeyKanzhelev · 2025-03-27T22:42:33Z

wg-node-lifecycle/charter.md

+
+Area we expect to explore:
+
+- An API to express node drain/maintenance.


why graceful termination that mentioned above is not listed here? Mosty curious

Ah, sorry, I have fixed that. I am also open to other KEPs/documents that people would like to include here.

fabriziopandini · 2025-03-28T15:28:17Z

@atiratree I'm Fabrizio Pandini from SIG Cluster Lifecycle and I just saw this proposal.
I was wondering if you or someone behind this effort will be in London so we can chat briefly in person about it.

atiratree · 2025-03-28T19:22:13Z

@fabriziopandini I will not be present at KubeCon, but feel free to connect with @rthallisey. We also plan to attend the SIG Cluster Lifecycle meeting after KubeCon.

rthallisey · 2025-03-28T20:28:59Z

@fabriziopandini I will be there.

For anyone else that will be at Kubecon eu next week that wants to have a high-level discussion in-person, please reach out to me on slack (rhallisey) or email ([email protected]). I'll do my best to connect with anyone interested.

hakman · 2025-03-29T09:27:54Z

@rthallisey by any chance will you also be at Maintainer Summit?

rthallisey · 2025-03-29T11:51:32Z

@hakman, yes I'll be at the Maintainer Summit

hakman · 2025-03-29T12:23:43Z

@hakman, yes I'll be at the Maintainer Summit

Awesome, this way maybe we can catch up easier. I think @justinsb also wants to say hello.

ivelichkovich · 2025-04-02T16:25:35Z

wg-node-lifecycle/charter.md

+  users can use the same APIs or configurations across the board.
+- Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter,
+  ...) and other scenarios to use the new approach.
+- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to


It seems like everyone is solving the problem of node maintenance independently and building private in house solutions. Improving the drain behavior is one aspect of maintenance (generally the first step after detection). There's additional steps once a node is ready to be acted on that everyone seems to have an in house solution for (especially for people serving accelerated infra).

An example might be a system that drains the node and then reboots it when a GPU fault is detected. That's just one example, the system should be able to take arbitrary actions based on various signals after waiting for a signal the node is all good to work on. Maybe some controller like "when you see X state create arbitrary CR Y" then users can extend the controller for Y to take whatever remediation action they want such as reboot / reset gpu drivers / reset NICs / etc

It seems like it would be good to come up with a community solution for how to take these actions after a node is drained and ready to be worked on. Thoughts on including this in the wg?

Yeah, we want to include these considerations in the WG. We imply these in our goals, but I have added your suggestion as an additional user story to make it clearer.

dims · 2025-04-03T07:25:42Z

@rthallisey please open a org membership request so we can add you here!

elmiko

i really like the direction this is going, i have a question and a suggestion.

elmiko · 2025-04-03T08:47:37Z

wg-node-lifecycle/charter.md

+support advanced use cases across the ecosystem.
+
+To properly solve the node drain, we must first understand the node lifecycle. This includes
+provisioning/sunsetting of the nodes, PodDisruptionBudgets, API-initiated eviction and node


should we specifically include topology spread constraints in this list as well?

Scheduling is certainly an important part as well. I have added a mention of scheduling constraints to our goals.

elmiko · 2025-04-03T08:49:01Z

wg-node-lifecycle/charter.md

+### Out of scope
+
+- Implementing cloud provider specific logic, the goal is to have high-level API that the providers
+  can use, hook into, or extend.


elmiko · 2025-04-03T08:54:57Z

wg-node-lifecycle/charter.md

+- As a user, I want my application to finish all network and storage operations before terminating a
+  pod. This includes closing pod connections, removing pods from endpoints, writing cached writes
+  to the underlying storage and completing storage cleanup routines.
+


i think another user story is around the use of ephemeral low cost instances on cloud providers. eg

As a cluster admin, I would like to use a mixture of on-demand and temporary spot instances in my clusters to reduce cloud expenditure. Having more reliable lifecycle and drain mechanisms for nodes will improve cluster stability in scenarios where instances may be terminated by the cloud provider due to cost-related thresholds.

Agree, this story is also important to have. Thanks!

ivelichkovich · 2025-04-03T09:11:42Z

wg-node-lifecycle/charter.md

+
+### In scope
+
+- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs


Is it worth including something about DRA device taints/drains?

Seems relevant to me as this affects the pod and device/node lifecycle. @pohly what do you think about including and discussing kubernetes/enhancements#5055 in the WG?

Co-authored-by: Ryan Hallisey <[email protected]>

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/community-management area/slack-management Issues or PRs related to the Slack Management subproject labels Mar 24, 2025

k8s-ci-robot requested review from ahg-g and ardaguclu March 24, 2025 12:17

atiratree changed the title ~~Introduce Node Lifecycle WG~~ WIP: Introduce Node Lifecycle WG Mar 24, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025

atiratree force-pushed the wg-node-lifecycle branch from 75e1096 to a19a192 Compare March 24, 2025 14:47

k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 24, 2025

k8s-ci-robot requested a review from marquiz March 25, 2025 17:09

atiratree force-pushed the wg-node-lifecycle branch from a19a192 to 2d6ac13 Compare March 25, 2025 17:52

k8s-ci-robot requested review from hakman and AndrewSirenko March 27, 2025 13:22

k8s-ci-robot requested a review from elmiko March 27, 2025 18:19

k8s-ci-robot requested a review from jackfrancis March 27, 2025 21:35

SergeyKanzhelev reviewed Mar 27, 2025

View reviewed changes

atiratree force-pushed the wg-node-lifecycle branch from 27d96ba to 185b98b Compare March 28, 2025 18:51

ivelichkovich reviewed Apr 2, 2025

View reviewed changes

elmiko reviewed Apr 3, 2025

View reviewed changes

ivelichkovich reviewed Apr 3, 2025

View reviewed changes

dhenkel92 mentioned this pull request Apr 3, 2025

Enhancement: Add a new probe to distinguish between application is ready to serve traffic and when it is safe to disrupt. kubernetes/kubernetes#131167

Open

atiratree force-pushed the wg-node-lifecycle branch 2 times, most recently from 8bd9373 to f1fe43f Compare April 4, 2025 11:20

Introduce Node Lifecycle WG

0d4e43a

Co-authored-by: Ryan Hallisey <[email protected]>

atiratree force-pushed the wg-node-lifecycle branch from f1fe43f to 0d4e43a Compare April 4, 2025 11:21


		### In scope

		- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs

		accelerators; it's far too expensive. It is more cost-effective to coordinate a drain and then
		upgrade.


		Area we expect to explore:

		- An API to express node drain/maintenance.

WIP: Introduce Node Lifecycle WG #8396

Are you sure you want to change the base?

WIP: Introduce Node Lifecycle WG #8396

Conversation

atiratree commented Mar 24, 2025

k8s-ci-robot commented Mar 24, 2025

atiratree commented Mar 24, 2025

rthallisey commented Mar 24, 2025

atiratree commented Mar 24, 2025

marquiz commented Mar 25, 2025

ajaysundark commented Mar 25, 2025

hakman commented Mar 27, 2025

elmiko commented Mar 27, 2025

jackfrancis commented Mar 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini commented Mar 28, 2025

atiratree commented Mar 28, 2025

rthallisey commented Mar 28, 2025

hakman commented Mar 29, 2025

rthallisey commented Mar 29, 2025

hakman commented Mar 29, 2025

ivelichkovich Apr 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dims commented Apr 3, 2025

elmiko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivelichkovich Apr 2, 2025 •

edited

Loading