-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Introduce Node Lifecycle WG #8396
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3697,6 +3697,58 @@ workinggroups: | |
liaison: | ||
github: saschagrunert | ||
name: Sascha Grunert | ||
- dir: wg-node-lifecycle | ||
name: Node Lifecycle | ||
mission_statement: > | ||
Explore and improve node and pod lifecycle in Kubernetes. This should result in | ||
better node drain/maintenance support and better pod disruption/termination. It | ||
should also improve node and pod autoscaling, better application migration and | ||
availability, load balancing, de/scheduling, node shutdown, cloud provider integrations, | ||
and support other new scenarios and integrations. | ||
charter_link: charter.md | ||
stakeholder_sigs: | ||
- Apps | ||
- Architecture | ||
- Autoscaling | ||
- CLI | ||
- Cloud Provider | ||
- Cluster Lifecycle | ||
- Network | ||
- Node | ||
atiratree marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Scheduling | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Dont we need to add other infra SIGs like storage as stakeholders too ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 added, this list is not final and we will have conversations with each SIG after the KubeCon if they want to be part of this WG There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know SIG Storage is impacted as well, I am just not sure how much cooperation is needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks @atiratree 👍 .. I would like to help on this effort anything related to storage.. Tagging leads from storage here.. ( @xing-yang @msau42 @jsafrane @saad-ali ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice, thanks for the help @humblec! Added. |
||
- Storage | ||
label: node-lifecycle | ||
leadership: | ||
chairs: | ||
- github: atiratree | ||
name: Filip Křepinský | ||
company: Red Hat | ||
email: [email protected] | ||
- github: fabriziopandini | ||
name: Fabrizio Pandini | ||
company: VMware | ||
email: [email protected] | ||
- github: humblec | ||
name: Humble Chirammal | ||
company: VMware | ||
email: [email protected] | ||
- github: rthallisey | ||
name: Ryan Hallisey | ||
company: NVIDIA | ||
email: [email protected] | ||
meetings: | ||
- description: WG Node Lifecycle Weekly Meeting | ||
day: TBD | ||
time: TBD | ||
tz: TBD | ||
frequency: weekly | ||
contact: | ||
slack: wg-node-lifecycle | ||
mailing_list: https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle | ||
liaison: | ||
github: TBD | ||
name: TBD | ||
- dir: wg-policy | ||
name: Policy | ||
mission_statement: > | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# See the OWNERS docs at https://go.k8s.io/owners | ||
|
||
reviewers: | ||
- wg-node-lifecycle-leads | ||
approvers: | ||
- wg-node-lifecycle-leads | ||
labels: | ||
- wg/node-lifecycle |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
<!--- | ||
This is an autogenerated file! | ||
Please do not edit this file directly, but instead make changes to the | ||
sigs.yaml file in the project root. | ||
To understand how this file is generated, see https://git.k8s.io/community/generator/README.md | ||
---> | ||
# Node Lifecycle Working Group | ||
|
||
Explore and improve node and pod lifecycle in Kubernetes. This should result in better node drain/maintenance support and better pod disruption/termination. It should also improve node and pod autoscaling, better application migration and availability, load balancing, de/scheduling, node shutdown, cloud provider integrations, and support other new scenarios and integrations. | ||
|
||
The [charter](charter.md) defines the scope and governance of the Node Lifecycle Working Group. | ||
|
||
## Stakeholder SIGs | ||
* [SIG Apps](/sig-apps) | ||
* [SIG Architecture](/sig-architecture) | ||
* [SIG Autoscaling](/sig-autoscaling) | ||
* [SIG CLI](/sig-cli) | ||
* [SIG Cloud Provider](/sig-cloud-provider) | ||
* [SIG Cluster Lifecycle](/sig-cluster-lifecycle) | ||
* [SIG Network](/sig-network) | ||
* [SIG Node](/sig-node) | ||
atiratree marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* [SIG Scheduling](/sig-scheduling) | ||
* [SIG Storage](/sig-storage) | ||
|
||
## Meetings | ||
*Joining the [mailing list](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle) for the group will typically add invites for the following meetings to your calendar.* | ||
* WG Node Lifecycle Weekly Meeting: [TBDs at TBD TBD]() (weekly). [Convert to your timezone](http://www.thetimezoneconverter.com/?t=TBD&tz=TBD). | ||
|
||
## Organizers | ||
|
||
* Filip Křepinský (**[@atiratree](https://github.com/atiratree)**), Red Hat | ||
* Fabrizio Pandini (**[@fabriziopandini](https://github.com/fabriziopandini)**), VMware | ||
* Humble Chirammal (**[@humblec](https://github.com/humblec)**), VMware | ||
* Ryan Hallisey (**[@rthallisey](https://github.com/rthallisey)**), NVIDIA | ||
|
||
## Contact | ||
- Slack: [#wg-node-lifecycle](https://kubernetes.slack.com/messages/wg-node-lifecycle) | ||
- [Mailing list](https://groups.google.com/a/kubernetes.io/g/wg-node-lifecycle) | ||
- [Open Community Issues/PRs](https://github.com/kubernetes/community/labels/wg%2Fnode-lifecycle) | ||
- Steering Committee Liaison: TBD (**[@TBD](https://github.com/TBD)**) | ||
<!-- BEGIN CUSTOM CONTENT --> | ||
|
||
<!-- END CUSTOM CONTENT --> |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -0,0 +1,160 @@ | ||||||
# WG Node Lifecycle Charter | ||||||
|
||||||
This charter adheres to the conventions described in the [Kubernetes Charter README] and uses | ||||||
the Roles and Organization Management outlined in [wg-governance]. | ||||||
|
||||||
[Kubernetes Charter README]: /committee-steering/governance/README.md | ||||||
|
||||||
## Scope | ||||||
|
||||||
The Kubernetes ecosystem currently faces challenges in node maintenance scenarios, with multiple | ||||||
projects independently addressing similar issues. The goal of this working group is to develop | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we list out the references of the projects if possible ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might be too heavy for the WG declaration. We have a partial list in https://github.com/atiratree/kube-enhancements/blob/improve-node-maintenance/keps/sig-apps/4212-declarative-node-maintenance/README.md#motivation and I expect other projects to be discussed/added in the future. I would prefer to track it separately once the WG is established. Let's also see what others think. If there is a demand I can add it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to maybe listing this out. Or at least a link to where we can find more information There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We have built an internal version, some others are draino and medik8s There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, let's add the list here as a starting point then. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added a new |
||||||
unified APIs that the entire ecosystem can depend on, reducing the maintenance burden across | ||||||
projects and addressing scenarios that impede node drain or cause improper pod termination. Our | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it the plan to add a unified API which control the kubernetes features in this area ? ( like Priority classes, PDBs..etc? ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, autoscaling (HPA), scheduling and the API objects you have listed are in scope. Currently, it is not yet clear which existing APIs will be affected, but we will take them into account. Btw, I am not sure why the thread about the references of the projects is not shown? Here is the link https://github.com/kubernetes/community/pull/8396/files#r2016483839 |
||||||
objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So the goal is APIs to use by solutions or implement a solution? These two sentences seems to be at odds. Maybe mention that k8s has no plans blocking customers implementing advanced use cases There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We want to do both :) I have added another sentence there to explain it better. |
||||||
existing APIs and behaviors. We will strive to make these solutions minimalistic and extensible to | ||||||
support advanced use cases across the ecosystem. | ||||||
|
||||||
To properly solve the node drain, we must first understand the node lifecycle. This includes | ||||||
provisioning/sunsetting of the nodes, PodDisruptionBudgets, API-initiated eviction and node | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should we specifically include topology spread constraints in this list as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Scheduling is certainly an important part as well. I have added a mention of scheduling constraints to our goals. |
||||||
shutdown. This then impacts both the node and pod autoscaling, de/scheduling, load balancing, and | ||||||
the applications running in the cluster. All of these areas have issues and would benefit from a | ||||||
unified approach. | ||||||
|
||||||
### In scope | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another goal should include making Pods work reliably while terminating. This is important since with prolifiration of non-live migratable VMs with accelerators, we see more and more situations when maintanence-caused termination should be taking hours if not days. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good idea, I have added a two more stories. All in all, the In scope section covers this in general, I hope. |
||||||
|
||||||
- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please add a goal of migrating existing scenarios to the new API so the group will be tasked to not break users when they are upgrading There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We do have that in scope under
So far it is pretty generic until we have a clearer vision. Please let me know if you would like to see something more specific. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it worth including something about DRA device taints/drains? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems relevant to me as this affects the pod and device/node lifecycle. @pohly what do you think about including and discussing kubernetes/enhancements#5055 in the WG? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node. So I guess it depends? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For what I understand about device taints, they are a way we can make device health scheduler-aware. This fits into our scope because we need a way to decide what Node should be prioritized for maintenance and a plan to drain that Node. E.g. a Node with all its devices tainted is a great target for Node maintenance. However, I think we should hold onto the DRA device taints feature for when we discuss Node Maintenance and Node Drain designs. I don't think we need it called out in the scope as Node Maintenance and Node Drain should cover it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would also like to handle the pod lifecycle better in any descheduling scenario (not just Node Drain/Maintenance). One option is to use the EvictionRequest API which should give more power to the applications that are being disrupted. So it might be interesting to see if we can make the disruption more graceful in 5055. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The other end of this to consider though is devices that span multiple nodes There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added it to the WG. I think it would be good to discuss this feature and its impact . Also, it might be better to hold off beta for some time. |
||||||
and extending the current ones. This includes exploring extension to or interactions with the Node | ||||||
object. | ||||||
- Analyze the node lifecycle, the Node API, and possible interactions. We want to explore augmenting | ||||||
the Node API to expose additional state or status in order to coalesce other core Kubernetes and | ||||||
community APIs around node lifecycle management. | ||||||
- Improve the disruption model that is currently implemented by API-initiated Eviction API and PDBs. | ||||||
Improve the descheduling, availability and migration capabilities of today's application | ||||||
workloads. Also explore the interactions with other eviction mechanisms. | ||||||
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle. | ||||||
To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000) | ||||||
feature to GA and resolve the associated node shutdown issues. | ||||||
- Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and | ||||||
the new disruption model/evictions. This includes balancing of the pods according to scheduling | ||||||
constraints. | ||||||
- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance. | ||||||
- Explore the cloud provider use cases and how they can hook in into the node lifecycle. So that the | ||||||
users can use the same APIs or configurations across the board. | ||||||
- Migrate users of the eviction based kubectl-like drain (kubectl, cluster autoscaler, karpenter, | ||||||
...) and other scenarios to use the new approach. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Make it more clear that when you say "new approach" you're referring to "a unified way of draining the nodes" you mentioned in the first bullet point. Either move this sentence up or clarify it here. |
||||||
- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like everyone is solving the problem of node maintenance independently and building private in house solutions. Improving the drain behavior is one aspect of maintenance (generally the first step after detection). There's additional steps once a node is ready to be acted on that everyone seems to have an in house solution for (especially for people serving accelerated infra). An example might be a system that drains the node and then reboots it when a GPU fault is detected. That's just one example, the system should be able to take arbitrary actions based on various signals after waiting for a signal the node is all good to work on. Maybe some controller like "when you see X state create arbitrary CR Y" then users can extend the controller for Y to take whatever remediation action they want such as reboot / reset gpu drivers / reset NICs / etc It seems like it would be good to come up with a community solution for how to take these actions after a node is drained and ready to be worked on. Thoughts on including this in the wg? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we want to include these considerations in the WG. We imply these in our goals, but I have added your suggestion as an additional user story to make it clearer. |
||||||
track and react to each of them. Consider past discussions/historical perspective | ||||||
(e.g. "thumbstones"). | ||||||
|
||||||
### Out of scope | ||||||
|
||||||
- Implementing cloud provider specific logic, the goal is to have high-level API that the providers | ||||||
can use, hook into, or extend. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 |
||||||
- Infrastructure provisioning, deprovisioning solution or physical infrastructure lifecycle | ||||||
management solution. | ||||||
|
||||||
## Stakeholders | ||||||
|
||||||
- SIG Apps | ||||||
- SIG Architecture | ||||||
- SIG Autoscaling | ||||||
- SIG CLI | ||||||
- SIG Cloud Provider | ||||||
- SIG Cluster Lifecycle | ||||||
- SIG Network | ||||||
- SIG Node | ||||||
atiratree marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
- SIG Scheduling | ||||||
- SIG Storage | ||||||
|
||||||
Stakeholders span from multiple SIGs to a broad set of end users, | ||||||
public and private cloud providers, Kubernetes distribution providers, | ||||||
and cloud provider end-users. Here are some user stories: | ||||||
|
||||||
- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we add a goal to explore the scenario of getting the historical perspective on why the node was terminated/drained/killed. This comes up very often and maybe we can help those scenarios in this WG. Various ideas like Node object "thumbstones" were discussed in the past. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not really sure I fully understand this. I have added a new point to the In scope section that mentions this. Feel free to write a GitHub suggestion. |
||||||
any required manual interventions. I also want to be able to observe the node drain via the API | ||||||
and check on its progress. I also want to be able to discover workloads that are blocking the node | ||||||
drain. | ||||||
- To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet | ||||||
and other actors should use a new eviction API to gracefully remove pods. This should enable new | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: change "should" to something like "I want to" given that the KEP hasn't been accepted yet |
||||||
migration strategies that prefer to surge (upscale) pods first rather than downscale them. It | ||||||
should also allow other users/components to monitor pods that are gracefully removed/terminated | ||||||
and provide better behaviour in terms of de/scheduling, scaling and availability. | ||||||
- As a cluster admin, I want to be able to perform arbitrary actions after the node drain is | ||||||
complete, such as resetting GPU drivers, resetting NICs, performing software updates or shutting | ||||||
down the machine. | ||||||
- As an end user, I would like more alternatives to blue-green upgrades, especially with special | ||||||
hardware accelerators; it's far too expensive. I would like to choose a strategy on how to | ||||||
coordinate the node drain and the upgrade to achieve better cost-effectiveness. | ||||||
- As a cloud provider, I need to perform regular maintenance on the hardware in my fleet. Enhancing | ||||||
Kubernetes to help CSPs safely remove hardware will reduce operational costs. | ||||||
- Modelling the cost of doing accelerator maintenance in today's world can be massive. And since | ||||||
hardware accelerators tend to need more love and care, having software support to coordinate | ||||||
maintenance will reduce operational costs. | ||||||
- As a cluster admin, I would like to use a mixture of on-demand and temporary spot instances in my | ||||||
clusters to reduce cloud expenditure. Having more reliable lifecycle and drain mechanisms for | ||||||
nodes will improve cluster stability in scenarios where instances may be terminated by the cloud | ||||||
provider due to cost-related thresholds. | ||||||
- As a user, I want to prevent any disruption to my pet or expensive workloads (VMs, ML with | ||||||
accelerators) and either prevent termination altogether or have a reliable migration path. | ||||||
Features like `terminationGracePeriodSeconds` are not sufficient as the termination/migration can | ||||||
take hours if not days. | ||||||
Comment on lines
+98
to
+101
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can users not already do this with a PDB? Are we suggesting that node maintenance would override blocking PDBs if they block for some extended period of time? I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think PDBs have a different use-case, so we may need to reword. The PodDisruptionBudget protects the availability of the application. What we're saying is that there's no API that protects both the availability of the infrastructure and the availability of the application. E.g. an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish. It's in the best interest of the application and infrastructure provider that an admin remediates the accelerator, so admin-user mutually agree on when that can occur. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I agree there's a problem that eviction API / drain doesn't guarantee it will finish within a reasonable time, especially if the node is having issues (things get stuck terminating etc.). But this at least we can do today right?
You can just taint the node or the devices with NoSchedule. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, that is a solution assuming that:
However, 1) can theoretically always work but it is the slowest possible solution and 2) is not guaranteed to work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is a good example. We want the applications/admins be aware of upcoming maintenances. But also pods in most of descheduling scenarios so that they are given opportunity to migrate or cleanup before the termination, which is hard to do with PDBs. The goal is not to override PDBs (it is also hard to do without breaking someone), the goal is to have a smarter layer above the PDBs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just the eviction API alone can cause pods to get stuck. All in all, I would prefer we do not dive deep into the topic and focus mostly on the scope in this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm.. ooc, can you please share any reference issues where |
||||||
- As a user, I want my application to finish all network and storage operations before terminating a | ||||||
pod. This includes closing pod connections, removing pods from endpoints, writing cached writes | ||||||
to the underlying storage and completing storage cleanup routines. | ||||||
Comment on lines
+102
to
+104
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In some scenarios, evicting/removing certain pods would prevent some of these operations. Consider if we were also evicting daemonset pods as an option (is that a goal I wasn't sure?), then we might need ordering somehow to make sure the CSI driver or CNI driver aren't removed until certain other cleanup has happened There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way we handle this is our internal drain API has a label selector for things to ignore and by default just ignores daemonsets instead of worrying about the ordering here. Daemonsets aren't really supported under drain anyway. That should handle most cases for system level things like csi/cni/etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, daemonset pods should be considered as part of the maintenance scenarios, especially in cases when the node is going to shutdown. I have added it to the goals to make it clear. We also had discussions with SIG node about static pod termination in the past, and they were generally not against it. But we lack use cases for it so far. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. iiuc, this proposal aim to define an order for static pod termination, DaemonSet pods or other There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not saying that we will solve static pod termination, just that we will look into that :) And yes, I think there should definitely be an ordering for both DS/Static. |
||||||
|
||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think another user story is around the use of ephemeral low cost instances on cloud providers. eg
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree, this story is also important to have. Thanks! |
||||||
## Deliverables | ||||||
|
||||||
The WG will coordinate requirement gathering and design, eventually leading to | ||||||
KEP(s)s and code associated with the ideas. | ||||||
|
||||||
Area we expect to explore: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Node Lifecycle around accelerators always comes up. Is there consideration in this group to explore these areas? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, although I am not sure if there will be deliverables specifically targeting accelerators yet. We mention them in the user stories and will consider them when creating the APIs. |
||||||
|
||||||
- An API to express node drain/maintenance. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One problem I feel we will need to address is how to transition existing drain logic in various components to this new API. Having a new API without migrating old ways to this new API create "yet another" way to do it and requires end user to understand more draining logics There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is still no upgrade path, since we have not even agreed on the solution. We don't want to break people using the current approaches. The main incentive to switch should be painless upgrades/maintenance and other benefits. I expect that the main components/users that use the kubectl(-like) drain should not have a hard time using the new solution(s). However, I am not sure what it will look like for the GNS, for example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why graceful termination that mentioned above is not listed here? Mosty curious There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, sorry, I have fixed that. I am also open to other KEPs/documents that people would like to include here. |
||||||
Currently tracked in https://github.com/kubernetes/enhancements/issues/4212. | ||||||
- An API to solve the problems wrt the API-initiated Eviction API and PDBs. | ||||||
Currently tracked in https://github.com/kubernetes/enhancements/issues/4563. | ||||||
- An API/mechanism to gracefully terminate pods during a node shutdown. | ||||||
Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000. | ||||||
- An API to deschedule pods that use DRA devices. | ||||||
DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055. | ||||||
- An API to remove pods from endpoints before they terminate. | ||||||
Currently tracked in https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y. | ||||||
- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve | ||||||
wide range of issue. | ||||||
|
||||||
We expect to provide reference implementations of the new APIs including but not limited to | ||||||
controllers, API validation, integration with existing core components and extension points for the | ||||||
ecosystem. This should be accompanied by E2E / Conformance tests. | ||||||
|
||||||
## Relevant Projects | ||||||
|
||||||
This is a list of known projects that solve similar problems in the ecosystem or would benefit from | ||||||
the efforts of this WG: | ||||||
|
||||||
- https://github.com/aws/aws-node-termination-handler | ||||||
- https://github.com/foriequal0/pod-graceful-drain | ||||||
- https://github.com/kubereboot/kured | ||||||
- https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler | ||||||
- https://github.com/kubernetes-sigs/karpenter | ||||||
- https://github.com/kubevirt/kubevirt | ||||||
- https://github.com/medik8s/node-maintenance-operator | ||||||
- https://github.com/Mellanox/maintenance-operator | ||||||
- https://github.com/openshift/machine-config-operator | ||||||
- https://github.com/planetlabs/draino | ||||||
- https://github.com/strimzi/drain-cleaner | ||||||
|
||||||
There are also internal custom solutions that companies use. | ||||||
|
||||||
## Roles and Organization Management | ||||||
|
||||||
This WG adheres to the Roles and Organization Management outlined in [wg-governance] | ||||||
and opts-in to updates and modifications to [wg-governance]. | ||||||
|
||||||
[wg-governance]: /committee-steering/governance/wg-governance.md | ||||||
|
||||||
## Timelines and Disbanding | ||||||
|
||||||
The working group will disband when the KEPs we create are completed. We will | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
review whether the working group should disband if appropriate SIG ownership | ||||||
can't be reached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are missing network, that owns loadbalancing, endpoints, ... and already has serious problems because of known issues like https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 added the SIG and the doc link, we need to address the loadbalancing and endpoints as well