-
Notifications
You must be signed in to change notification settings - Fork 1.5k
KEP-5328: Node Capabilities #5347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Welcome @pravk03! |
Hi @pravk03. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pravk03 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
59e7e54
to
4719180
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dom4ha @sanposhiho @macsko - FYI
### API Changes | ||
|
||
\ | ||
Add a field `NodeCapabilities` field as type `map[string]string `to the` NodeSpec.NodeStatus` structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this really be string=>string?
To make that useful, it needs to carry the semantic and be understood in exactly the same way by scheduler.
I don't have a counterproposal for it - but maybe we can somehow couple that with certain pod features?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string=>string format was chosen for its flexibility, similar to node labels and annotations. This allows us to represent diverse information types and easily add new capabilities without requiring API schema changes.
I agree that clear semantic understanding by the scheduler is necessary. This could be achieved through consistent, DNS-style keys where the naming convention itself (e.g. kubernetes.io/feature/featureName) effectively defines each capability ?
* Requires modifying a core Kubernetes API object, leading to complexities in versioning, upgrades, and maintenance. Extending the types of capabilities might require further core API changes. | ||
* Would make the `NodeStatus` object larger and less focused on just the operational status. | ||
* Scalability | ||
* Updating `NodeCapabilities` with every `NodeStatus` update is a waste of network resources and API server processing because the information in NodeCapabilities doesn't change frequently. A large NodeCapabilities field, especially with many features or resources, significantly increases the size of the NodeStatus object. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How many capabilities do we expect here?
- Once we decide to publish capability X in the status - are we effectively committing to publishing it forever? Or do we plan to no longer public it after X happens (whether X is an event or time-based)? If we never trim the capability - we're effectively risking growing the object forever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some capabilities, like those reflecting stable kernel or runtime features are expected to be long-lived and would persist as long as they remain relevant on the node. For capabilities tied to Kubelet features in alpha or beta stages, they can be be automatically deprecated after the feature becomes GA.
Its challenging to estimate the number of capabilities, the proposal's restriction to publishing only information actionable by control plane component (scheduler or admission controllers) is designed to inherently keep this number manageable.
- "@pravk03" | ||
owning-sig: sig-node | ||
participating-sigs: | ||
- sig-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should put sig-scheduling here and have a reviewer from that sig.
|
||
This proposal includes **"Node Capabilities"** as scheduling mechanism in Kubernetes ensuring pods can run on nodes reliably on Nodes where they are scheduled while reducing the operational burden. It provides a standardized way for Kubelet to advertise specific node features and configurations, decreasing reliance on manual taints and labels for scheduling decisions. | ||
|
||
NodeCapabilities aims to prevent pods from being scheduled on incompatible nodes - those missing necessary features because of version skew between control plane and the Node or unsupported runtime/kernel configurations ([slack discussion](https://kubernetes.slack.com/archives/C5P3FE08M/p1741867194258139)). Making the scheduler aware of specific node capabilities will enable more reliable pod placement and ensure that incompatibilities are proactively identified as scheduling failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't dynamic - so we don't expect this to change anytime soon. So it should prevent scheduling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC some capabilities are, the ones that signify presence of necessary deamons. Newly turnup nodes would always start with such "missing" capabilities, but it would be temporary which should at least block binding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid any race conditions, the key requirement for NodeCapabilities is to include static configurations available during kubelet bootstrap. We should not really have temporarily missing capabilities.
### Non-Goals | ||
|
||
1. Replace taints/tolerations or node labels to aid with the scheduling decisions. | ||
2. This KEP focuses on introducing the NodeCapabilities API. The exact details of how specific Node Capabilities should be mapped to workload requirements is use case specific and out of scope for this KEP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the scheduling logic could be constructed if the specification which pod needs which capabilities can be designed? I think without it, the proposal is incomplete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some details on the kube-scheduler changes necessary and included example workflows in the Design section. PTAL.
### Goals | ||
|
||
1. Define a standard mechanism for Nodes to expose Kubelet, Runtime, and Kernel configurations that are pertinent to workload scheduling and/or improve API Request validation. | ||
2. Enhance the kube-scheduler to understand pod requirements and match them against Node capabilities and place pods on compatible nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give some examples in User stories how pod requirement may map to specific capabilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Added some examples.
4c11e06
to
9254f9b
Compare
/cc @tallclair @yujuhong |
9254f9b
to
f8291a4
Compare
/sig scheduling |
a1bd12b
to
26e03c8
Compare
26e03c8
to
9dc58f7
Compare
|
||
`NodeCapabilityFilter` plugin would | ||
|
||
* Inspect the PodSpec to infer a set of required NodeCapability key-value pairs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this inference is confusing to me. How will the scheduler know which features in a pod map to one on a node? is it going to be declarative within kubernetes, or imperative?
Uh oh!
There was an error while loading. Please reload this page.