-
Notifications
You must be signed in to change notification settings - Fork 1.4k
📖 Update autoscaling from zero enhancement proposal with support for platform-aware autoscale from zero #11962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -107,8 +107,8 @@ node group. But, during a scale from zero situation (ie when a node group has ze | |
autoscaler needs to acquire this information from the infrastructure provider. | ||
|
||
An optional status field is proposed on the Infrastructure Machine Template which will be populated | ||
by infrastructure providers to contain the CPU, memory, and GPU capacities for machines described by that | ||
template. The cluster autoscaler will then utilize this information by reading the appropriate | ||
by infrastructure providers to contain the CPU, CPU architecture, memory, and GPU capacities for machines | ||
described by that template. The cluster autoscaler will then utilize this information by reading the appropriate | ||
infrastructure reference from the resource it is scaling (MachineSet or MachineDeployment). | ||
|
||
A user may override the field in the associated infrastructure template by applying annotations to the | ||
|
@@ -160,6 +160,10 @@ the template. Internally, this field will be represented by a Go `map` type uti | |
for the keys and `k8s.io/apimachinery/pkg/api/resource.Quantity` as the values (similar to how resource | ||
limits and requests are handled for pods). | ||
|
||
Additionally, the status field could contain information about the node, such as the architecture and | ||
operating system. This information is not required for the autoscaler to function, but it can be useful in | ||
scenarios where the autoscaler needs to make decisions for clusters with heterogeneous node groups in architecture, OS, or both. | ||
|
||
It is worth mentioning that the Infrastructure Machine Templates are not usually reconciled by themselves. | ||
Each infrastructure provider will be responsible for determining the best implementation for adding the | ||
status field based on the information available on their platform. | ||
|
@@ -175,6 +179,9 @@ const ( | |
// DockerMachineTemplateStatus defines the observed state of a DockerMachineTemplate | ||
type DockerMachineTemplateStatus struct { | ||
Capacity corev1.ResourceList `json:"capacity,omitempty"` | ||
|
||
// +optional | ||
NodeInfo *platform.NodeInfo `json:"nodeInfo,omitempty"` | ||
} | ||
|
||
// DockerMachineTemplate is the Schema for the dockermachinetemplates API. | ||
|
@@ -188,6 +195,40 @@ type DockerMachineTemplate struct { | |
``` | ||
_Note: the `ResourceList` and `ResourceName` referenced are from k8s.io/api/core/v1`_ | ||
|
||
`platform.NodeInfo` is a struct that contains the architecture and operating system information of the node, to implement | ||
in the providers integration code. | ||
Its definition would look like this: | ||
|
||
```go | ||
package platform | ||
|
||
// Architecture represents the CPU architecture of the node. Its underlying type is a string. | ||
// Its underlying type is a string and its value can be any of amd64, arm64, s390x, ppc64le. | ||
// +kubebuilder:validation:Enum=amd64;arm64;s390x;ppc64le | ||
// +enum | ||
aleskandro marked this conversation as resolved.
Show resolved
Hide resolved
|
||
type Architecture string | ||
sbueringer marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// Architecture constants defined for clarity. | ||
const ( | ||
ArchitectureAmd64 Architecture = "amd64" | ||
ArchitectureArm64 Architecture = "arm64" | ||
ArchitectureS390x Architecture = "s390x" | ||
ArchitecturePpc64le Architecture = "ppc64le" | ||
aleskandro marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
|
||
// NodeInfo contains information about the node's architecture and operating system. | ||
type NodeInfo struct { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest to use NodeSystemInfo, so there is a more direct link with the corresponding struct in corev1 types. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @fabriziopandini, the previous version was using NodeSystemInfo. I changed it due to #11962 (comment). cc @sbueringer I'm ok with either way as far as we have consensus :). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would favor NodeInfo because I think the field and struct name should be the same if there is no good reason to diverge. And I think the field name should be the same as in upstream There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm ok with both options, but the consistency with corev1.Node in this case IMO could help There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with both your points, but leaning a bit more into @fabriziopandini's option to be consistent with corev1.Node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking about this a little bit more and considering those are type implemented by each provider, I think we should not be prescriptive about the type name (nor host those types in CAPI at all). Instead we should carefully formalise only what the autoscaler implementation will rely on when dealing with unstructured types, like we did for capacity in https://github.com/kubernetes-sigs/cluster-api/blob/d0cb4af3c5f2f82b2dcfbe2a89c08661e6afe84b/docs/proposals/20210310-opt-in-autoscaling-from-zero.md#infrastructure-machine-template-status-updates Please note that we also have more recent examples of this approach, e.g. https://main.cluster-api.sigs.k8s.io/developer/providers/contracts/infra-machine#inframachine-conditions Once we settle clear expectation from a contract PoV, then what we put into the "Example implementation in Docker provider" is less relevant, so I don't care if how we call the struct. Instead I would prefer we avoid suggesting providers having this type in a separate package, because this can lead to issues when dealing with versioned APIs |
||
// architecture is the CPU architecture of the node. | ||
// Its underlying type is a string and its value can be any of amd64, arm64, s390x, ppc64le. | ||
// +optional | ||
Architecture Architecture `json:"architecture,omitempty"` | ||
// operatingSystem is a string representing the operating system of the node. | ||
// This may be a string like 'linux' or 'windows'. | ||
// +optional | ||
OperatingSystem string `json:"operatingSystem,omitempty"` | ||
} | ||
``` | ||
|
||
When used as a manifest, it would look like this: | ||
|
||
``` | ||
|
@@ -204,8 +245,13 @@ status: | |
memory: 500mb | ||
cpu: "1" | ||
nvidia.com/gpu: "1" | ||
nodeInfo: | ||
aleskandro marked this conversation as resolved.
Show resolved
Hide resolved
|
||
architecture: arm64 | ||
operatingSystem: linux | ||
``` | ||
|
||
The information stored in the `status.nodeInfo` field will be used by the cluster autoscaler's scheduler simulator to determine the simulated node's labels `kubernetes.io/arch` and `kubernetes.io/os`. This logic will be implemented in the cluster autoscaler's ClusterAPI cloud provider code. | ||
|
||
#### MachineSet and MachineDeployment Annotations | ||
|
||
In cases where a user needs to provide specific resource information for a | ||
|
@@ -229,6 +275,8 @@ metadata: | |
capacity.cluster-autoscaler.kubernetes.io/memory: "500mb" | ||
capacity.cluster-autoscaler.kubernetes.io/cpu: "1" | ||
capacity.cluster-autoscaler.kubernetes.io/ephemeral-disk: "100Gi" | ||
node-info.cluster-autoscaler.kubernetes.io/architecture: "arm64" | ||
node-info.cluster-autoscaler.kubernetes.io/operating-system: "linux" | ||
``` | ||
_Note: the annotations will be defined in the cluster autoscaler, not in cluster-api._ | ||
|
||
|
@@ -246,6 +294,9 @@ metadata: | |
capacity.cluster-autoscaler.kubernetes.io/taints: "key1=value1:NoSchedule,key2=value2:NoExecute" | ||
``` | ||
|
||
The `capacity.cluster-autoscaler.kubernetes.io/labels` annotation takes precedence over other sources when conflicts occur. | ||
For instance, if the `kubernetes.io/arch` label is specified in `capacity.cluster-autoscaler.kubernetes.io/labels`, its value supersedes that specified by `node-info.cluster-autoscaler.kubernetes.io/architecture`. | ||
Comment on lines
+297
to
+298
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm getting back to this because I'm still a little bit confused. I was assuming that capacity.cluster-autoscaler.kubernetes.io and node-info.cluster-autoscaler.kubernetes.io serve a different pourpuse, so value cannot overlap (the example with both reporting arch is confusing). Instead what can happen IMO is that each annotation overlaps with the corresponding struct in templates (capacity.cluster-autoscaler.kubernetes.io with status.capacity and node-info.cluster-autoscaler.kubernetes.io with status.nodeIndfo), and I was expecting this paragraph to define which of the two takes precedence There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The issue here is that node-info and status.node(System)?Info will render as labels for the scheduler simulator in the autoscaler side. So either we deny capacity labels to specify labels like kubernetes.io/arch or we have to define which one takes precedence over the other. The autoscaler and scheduler do not use status.nodeSystemInfo directly, but obey labels for pods that specify predicates for them There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generally speaking I would prefer to prevent users to make configuration that doesn't make sense instead of defining rules to handle those configurations. In this case, considering those labels are defined by the autoscaler and not by CAPI, I think that the most pragmatic approach is to document that If possilble, I would also suggest to move all considerations about labels for the scheduler simulator in the autoscaler side on a separated paragraphs, making it clear this is not something affecting CAPI users directly. Going back to what confuses me. Even if we remove the issue of mixing up values, from node-info and capacity, it still remain the fact that in CAPI it is possible to define the same setting in three places. I would also suggest to add an example clarifying what should happen if a user sets capacity.cluster-autoscaler.kubernetes.io/memory: "500mb" on MachineSet and capacity.cluster-autoscaler.kubernetes.io/memory: "200mb" on MachineDeployment, and status.capacity.memory = something else But ideally this example should go somewhere else, not in Node Labels and Taints paragraph. |
||
|
||
### Security Model | ||
|
||
This feature will require the service account associated with the cluster autoscaler to have | ||
|
@@ -318,6 +369,7 @@ office hours meeting: | |
|
||
## Implementation History | ||
|
||
- [X] 05/08/2025: Updated proposal to enable architecture- and OS- aware auto-scale from 0 | ||
- [X] 09/12/2024: Added section on Implementation Status | ||
- [X] 01/31/2023: Updated proposal to include annotation changes | ||
- [X] 06/10/2021: Proposed idea in an issue or [community meeting] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double check, are we sure we want to expose those types in a non-versioned package?
Also, given this is going to be part of the contract, shouldn't we use a package name that enforce this link (and that we can eventually expand with other contract related stuff)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were considering not to have these types in the cluster api repo, but delegating implementations to define their based on the contract defined here
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see #11962 (comment) (I think we should not host those type in CAPI, nor be prescriptive on how they are defined in go)