Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Introduce Node Lifecycle WG #8396

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

No description provided.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign parispittman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/community-management area/slack-management Issues or PRs related to the Slack Management subproject labels Mar 24, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and ardaguclu March 24, 2025 12:17
@k8s-ci-robot k8s-ci-robot added committee/steering Denotes an issue or PR intended to be handled by the steering committee. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 24, 2025
@atiratree atiratree changed the title Introduce Node Lifecycle WG WIP: Introduce Node Lifecycle WG Mar 24, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2025
@atiratree
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025
@rthallisey
Copy link

Looks like I'm not a member of kubernetes org anymore. I was a few years back, but didn't keep up with contributions recently. You can remove me as a lead and I can reapply after some contributions to this WG.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 24, 2025
@atiratree
Copy link
Member Author

We have had impactful conversations with Ryan about this group and its goals. He has experience with cluster maintenance and I look forward to his participation in the WG.

@marquiz
Copy link
Contributor

marquiz commented Mar 25, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from marquiz March 25, 2025 17:09
@ajaysundark
Copy link

/cc

@hakman
Copy link
Member

hakman commented Mar 27, 2025

/cc

@elmiko
Copy link
Contributor

elmiko commented Mar 27, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from elmiko March 27, 2025 18:19
@jackfrancis
Copy link
Contributor

/cc

projects independently addressing similar issues. The goal of this working group is to develop
unified APIs that the entire ecosystem can depend on, reducing the maintenance burden across
projects and addressing scenarios that impede node drain or cause improper pod termination. Our
objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the goal is APIs to use by solutions or implement a solution? These two sentences seems to be at odds. Maybe mention that k8s has no plans blocking customers implementing advanced use cases

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to do both :) I have added another sentence there to explain it better.


### In scope

- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add a goal of migrating existing scenarios to the new API so the group will be tasked to not break users when they are upgrading

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have that in scope under

Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter), and other scenarios to use the new approach.

So far it is pretty generic until we have a clearer vision. Please let me know if you would like to see something more specific.

shutdown. This then impacts both the node and pod autoscaling, load balancing, and the applications
running in the cluster. All of these areas have issues and would benefit from a unified approach.

### In scope
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another goal should include making Pods work reliably while terminating. This is important since with prolifiration of non-live migratable VMs with accelerators, we see more and more situations when maintanence-caused termination should be taking hours if not days.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I have added a two more stories. All in all, the In scope section covers this in general, I hope.

public and private cloud providers, Kubernetes distribution providers,
and cloud provider end-users. Here are some user stories:

- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a goal to explore the scenario of getting the historical perspective on why the node was terminated/drained/killed. This comes up very often and maybe we can help those scenarios in this WG. Various ideas like Node object "thumbstones" were discussed in the past.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure I fully understand this. I have added a new point to the In scope section that mentions this. Feel free to write a GitHub suggestion.

Comment on lines 78 to 79
accelerators; it's far too expensive. It is more cost-effective to coordinate a drain and then
upgrade.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a strong statement. Maybe we can say it more generically: Investigate the most cost effective ways to upgrade nodes with the expensive accelerators deployed on it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blue-green upgrade with accelerators use-case I think is an important one to mention in some way. Drain and upgrade always have a cost - money, time, complexity, ect.... We're trying to say that the current ecosystem of APIs and Tools, or lack thereof, cause solutions to be more "expensive" than it should be. We can rephrase to emphasize this.

Perhaps it would be better to focus on some specific examples showing the "cost". E.g. some people want to keep specific accelerators they have been using because of ware and tear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern mostly about wording. Even more cost effective is to force kill and recreate. Drain is not always the best path, depending on the workload

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, we have changed the wording.


Area we expect to explore:

- An API to express node drain/maintenance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One problem I feel we will need to address is how to transition existing drain logic in various components to this new API. Having a new API without migrating old ways to this new API create "yet another" way to do it and requires end user to understand more draining logics

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still no upgrade path, since we have not even agreed on the solution. We don't want to break people using the current approaches. The main incentive to switch should be painless upgrades/maintenance and other benefits.

I expect that the main components/users that use the kubectl(-like) drain should not have a hard time using the new solution(s). However, I am not sure what it will look like for the GNS, for example.


Area we expect to explore:

- An API to express node drain/maintenance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why graceful termination that mentioned above is not listed here? Mosty curious

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, I have fixed that. I am also open to other KEPs/documents that people would like to include here.

@fabriziopandini
Copy link
Member

@atiratree I'm Fabrizio Pandini from SIG Cluster Lifecycle and I just saw this proposal.
I was wondering if you or someone behind this effort will be in London so we can chat briefly in person about it.

@atiratree
Copy link
Member Author

@fabriziopandini I will not be present at KubeCon, but feel free to connect with @rthallisey. We also plan to attend the SIG Cluster Lifecycle meeting after KubeCon.

@rthallisey
Copy link

@fabriziopandini I will be there.

For anyone else that will be at Kubecon eu next week that wants to have a high-level discussion in-person, please reach out to me on slack (rhallisey) or email ([email protected]). I'll do my best to connect with anyone interested.

@hakman
Copy link
Member

hakman commented Mar 29, 2025

@rthallisey by any chance will you also be at Maintainer Summit?

@rthallisey
Copy link

@hakman, yes I'll be at the Maintainer Summit

@hakman
Copy link
Member

hakman commented Mar 29, 2025

@hakman, yes I'll be at the Maintainer Summit

Awesome, this way maybe we can catch up easier. I think @justinsb also wants to say hello.

users can use the same APIs or configurations across the board.
- Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter,
...) and other scenarios to use the new approach.
- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to
Copy link

@ivelichkovich ivelichkovich Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like everyone is solving the problem of node maintenance independently and building private in house solutions. Improving the drain behavior is one aspect of maintenance (generally the first step after detection). There's additional steps once a node is ready to be acted on that everyone seems to have an in house solution for (especially for people serving accelerated infra).

An example might be a system that drains the node and then reboots it when a GPU fault is detected. That's just one example, the system should be able to take arbitrary actions based on various signals after waiting for a signal the node is all good to work on. Maybe some controller like "when you see X state create arbitrary CR Y" then users can extend the controller for Y to take whatever remediation action they want such as reboot / reset gpu drivers / reset NICs / etc

It seems like it would be good to come up with a community solution for how to take these actions after a node is drained and ready to be worked on. Thoughts on including this in the wg?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we want to include these considerations in the WG. We imply these in our goals, but I have added your suggestion as an additional user story to make it clearer.

@dims
Copy link
Member

dims commented Apr 3, 2025

@rthallisey please open a org membership request so we can add you here!

image

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i really like the direction this is going, i have a question and a suggestion.

support advanced use cases across the ecosystem.

To properly solve the node drain, we must first understand the node lifecycle. This includes
provisioning/sunsetting of the nodes, PodDisruptionBudgets, API-initiated eviction and node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we specifically include topology spread constraints in this list as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scheduling is certainly an important part as well. I have added a mention of scheduling constraints to our goals.

### Out of scope

- Implementing cloud provider specific logic, the goal is to have high-level API that the providers
can use, hook into, or extend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

- As a user, I want my application to finish all network and storage operations before terminating a
pod. This includes closing pod connections, removing pods from endpoints, writing cached writes
to the underlying storage and completing storage cleanup routines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think another user story is around the use of ephemeral low cost instances on cloud providers. eg

As a cluster admin, I would like to use a mixture of on-demand and temporary spot instances in  my clusters to reduce cloud expenditure. Having more reliable lifecycle and drain mechanisms for nodes will improve cluster stability in scenarios where instances may be terminated by the cloud provider due to cost-related thresholds.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this story is also important to have. Thanks!


### In scope

- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth including something about DRA device taints/drains?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems relevant to me as this affects the pod and device/node lifecycle. @pohly what do you think about including and discussing kubernetes/enhancements#5055 in the WG?

Co-authored-by: Ryan Hallisey <[email protected]>
@atiratree atiratree force-pushed the wg-node-lifecycle branch from f1fe43f to 0d4e43a Compare April 4, 2025 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/community-management area/slack-management Issues or PRs related to the Slack Management subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. committee/steering Denotes an issue or PR intended to be handled by the steering committee. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.