Skip to content

KEP-5328: Node Capabilities #5347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

pravk03
Copy link

@pravk03 pravk03 commented May 28, 2025

  • One-line PR description: Add the initial KEP for KEP 5328: Node Capabilities
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory labels May 28, 2025
@k8s-ci-robot
Copy link
Contributor

Welcome @pravk03!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 28, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @pravk03. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pravk03
Once this PR has been reviewed and has the lgtm label, please assign dchen1107 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 28, 2025
@pravk03 pravk03 marked this pull request as draft May 28, 2025 00:47
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2025
@pravk03 pravk03 force-pushed the node-capabilities branch 2 times, most recently from 59e7e54 to 4719180 Compare May 28, 2025 00:59
Copy link
Member

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

### API Changes

\
Add a field `NodeCapabilities` field as type `map[string]string `to the` NodeSpec.NodeStatus` structure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this really be string=>string?
To make that useful, it needs to carry the semantic and be understood in exactly the same way by scheduler.

I don't have a counterproposal for it - but maybe we can somehow couple that with certain pod features?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string=>string format was chosen for its flexibility, similar to node labels and annotations. This allows us to represent diverse information types and easily add new capabilities without requiring API schema changes.

I agree that clear semantic understanding by the scheduler is necessary. This could be achieved through consistent, DNS-style keys where the naming convention itself (e.g. kubernetes.io/feature/featureName) effectively defines each capability ?

* Requires modifying a core Kubernetes API object, leading to complexities in versioning, upgrades, and maintenance. Extending the types of capabilities might require further core API changes.
* Would make the `NodeStatus` object larger and less focused on just the operational status.
* Scalability
* Updating `NodeCapabilities` with every `NodeStatus` update is a waste of network resources and API server processing because the information in NodeCapabilities doesn't change frequently. A large NodeCapabilities field, especially with many features or resources, significantly increases the size of the NodeStatus object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. How many capabilities do we expect here?
  2. Once we decide to publish capability X in the status - are we effectively committing to publishing it forever? Or do we plan to no longer public it after X happens (whether X is an event or time-based)? If we never trim the capability - we're effectively risking growing the object forever

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some capabilities, like those reflecting stable kernel or runtime features are expected to be long-lived and would persist as long as they remain relevant on the node. For capabilities tied to Kubelet features in alpha or beta stages, they can be be automatically deprecated after the feature becomes GA.

Its challenging to estimate the number of capabilities, the proposal's restriction to publishing only information actionable by control plane component (scheduler or admission controllers) is designed to inherently keep this number manageable.

- "@pravk03"
owning-sig: sig-node
participating-sigs:
- sig-node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should put sig-scheduling here and have a reviewer from that sig.


This proposal includes **"Node Capabilities"** as scheduling mechanism in Kubernetes ensuring pods can run on nodes reliably on Nodes where they are scheduled while reducing the operational burden. It provides a standardized way for Kubelet to advertise specific node features and configurations, decreasing reliance on manual taints and labels for scheduling decisions.

NodeCapabilities aims to prevent pods from being scheduled on incompatible nodes - those missing necessary features because of version skew between control plane and the Node or unsupported runtime/kernel configurations ([slack discussion](https://kubernetes.slack.com/archives/C5P3FE08M/p1741867194258139)). Making the scheduler aware of specific node capabilities will enable more reliable pod placement and ensure that incompatibilities are proactively identified as scheduling failures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall it prevent pods from being scheduled or bound? We plan to separate workload scheduling from binding and scheduling may start considering non-existing pods yet, and so, not ready Nodes as well.

/cc @wojtek-t @x13n WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't dynamic - so we don't expect this to change anytime soon. So it should prevent scheduling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC some capabilities are, the ones that signify presence of necessary deamons. Newly turnup nodes would always start with such "missing" capabilities, but it would be temporary which should at least block binding.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid any race conditions, the key requirement for NodeCapabilities is to include static configurations available during kubelet bootstrap. We should not really have temporarily missing capabilities.

### Non-Goals

1. Replace taints/tolerations or node labels to aid with the scheduling decisions.
2. This KEP focuses on introducing the NodeCapabilities API. The exact details of how specific Node Capabilities should be mapped to workload requirements is use case specific and out of scope for this KEP.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How the scheduling logic could be constructed if the specification which pod needs which capabilities can be designed? I think without it, the proposal is incomplete.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some details on the kube-scheduler changes necessary and included example workflows in the Design section. PTAL.

### Goals

1. Define a standard mechanism for Nodes to expose Kubelet, Runtime, and Kernel configurations that are pertinent to workload scheduling and/or improve API Request validation.
2. Enhance the kube-scheduler to understand pod requirements and match them against Node capabilities and place pods on compatible nodes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some examples in User stories how pod requirement may map to specific capabilities?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. Added some examples.

@pravk03 pravk03 force-pushed the node-capabilities branch 3 times, most recently from 4c11e06 to 9254f9b Compare May 28, 2025 23:11
@pravk03 pravk03 changed the title KEP-5328: Node Capability Aware Scheduling KEP-5328: Node Capabilities May 28, 2025
@pravk03 pravk03 marked this pull request as ready for review May 28, 2025 23:14
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2025
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp May 28, 2025 23:14
@pravk03
Copy link
Author

pravk03 commented May 29, 2025

/cc @tallclair @yujuhong

@pravk03 pravk03 force-pushed the node-capabilities branch from 9254f9b to f8291a4 Compare May 29, 2025 01:06
@sanposhiho
Copy link
Member

/sig scheduling

@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label May 29, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling May 29, 2025
@pravk03 pravk03 force-pushed the node-capabilities branch 2 times, most recently from a1bd12b to 26e03c8 Compare May 30, 2025 00:44
@pravk03 pravk03 force-pushed the node-capabilities branch from 26e03c8 to 9dc58f7 Compare May 30, 2025 00:55

`NodeCapabilityFilter` plugin would

* Inspect the PodSpec to infer a set of required NodeCapability key-value pairs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this inference is confusing to me. How will the scheduler know which features in a pod map to one on a node? is it going to be declarative within kubernetes, or imperative?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.

6 participants