Skip to content

DNM: Kueue Operator API Review Part 2 #2250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 42 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
291568b
add kueue api for api-review
kannon92 Mar 7, 2025
1ae6862
kueue api review
kannon92 Mar 7, 2025
7d9e95f
update api with linter
kannon92 Mar 7, 2025
00a45be
api updates
kannon92 Mar 7, 2025
03bd30c
make linter pass for kueue
kannon92 Mar 10, 2025
6bd4e6f
update apis
kannon92 Mar 11, 2025
cab346d
resolving api changes
kannon92 Mar 12, 2025
5c036c0
option 3
kannon92 Mar 13, 2025
5d4942c
address comments
kannon92 Mar 13, 2025
4a5e0d0
api updates for unsupported
kannon92 Mar 14, 2025
23e19e4
drop metrics from api as we can detect without api
kannon92 Mar 14, 2025
84aeac6
Drop experimental options
kannon92 Mar 17, 2025
d515c66
shorten framework names
kannon92 Mar 17, 2025
c14027a
add listmapkey
kannon92 Mar 17, 2025
fd627fb
add godocs and fix schema generation
kannon92 Mar 18, 2025
0aba7b7
address final comments
kannon92 Mar 19, 2025
30a20f1
joel comments
kannon92 Mar 19, 2025
b295d9b
address joel feedback
kannon92 Mar 20, 2025
89f746c
add two apis, one left
kannon92 Mar 26, 2025
ededf97
update api
kannon92 Mar 28, 2025
929d011
api comment updates
kannon92 Mar 28, 2025
1acd09a
fair sharing classical
kannon92 Mar 28, 2025
6f95ae7
update and lint
kannon92 Mar 31, 2025
64c3bd7
api typo updates
kannon92 Apr 1, 2025
b367dd2
use pointers everywhere
kannon92 Apr 2, 2025
49b7815
api updates
kannon92 Apr 2, 2025
b17e4c5
updates
kannon92 Apr 2, 2025
cb9cc85
update api with bryce comments
kannon92 Apr 4, 2025
fb3b660
fix linter
kannon92 Apr 7, 2025
8741f7b
add bryce comments
kannon92 Apr 7, 2025
b8d2210
updates with apis
kannon92 Apr 9, 2025
40b549a
update docs
kannon92 Apr 9, 2025
e1d7706
bryce godoc comments
kannon92 Apr 9, 2025
12d66d5
bryce minor comments
kannon92 Apr 9, 2025
1b672ee
godocs
kannon92 Apr 10, 2025
bedbe63
consistent api
kannon92 Apr 10, 2025
2bf301a
more comments
kannon92 Apr 10, 2025
bde4a92
remove omit empty
kannon92 Apr 13, 2025
185dd84
fix issues with discoverability
kannon92 Apr 14, 2025
9ca52b8
address feedback
kannon92 Apr 14, 2025
405ba5e
update with latest godocs
kannon92 Apr 15, 2025
e92caa9
default godocs
kannon92 Apr 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
504 changes: 504 additions & 0 deletions openapi/generated_openapi/zz_generated.openapi.go

Large diffs are not rendered by default.

288 changes: 288 additions & 0 deletions openapi/openapi.json

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions operator/v1alpha1/register.go
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@ func addKnownTypes(scheme *runtime.Scheme) error {
&EtcdBackupList{},
&ClusterVersionOperator{},
&ClusterVersionOperatorList{},
&Kueue{},
&KueueList{},
)

return nil
Expand Down
277 changes: 277 additions & 0 deletions operator/v1alpha1/types_kueue.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
package v1alpha1

import (
operatorv1 "github.com/openshift/api/operator/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

// Kueue is the CRD to represent the Kueue operator
// This CRD defines the configuration that the Kueue
// Compatibility level 4: No compatibility is provided, the API can change at any point for any reason. These capabilities should not be used by applications needing long term support.
// +openshift:compatibility-gen:level=4
// +kubebuilder:object:root=true
// +kubebuilder:resource:path=kueue,scope=Cluster
// +k8s:openapi-gen=true
// +genclient
// +genclient:nonNamespaced
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
// +kubebuilder:validation:XValidation:rule="self.metadata.name == 'cluster'",message="Kueue is a singleton, .metadata.name must be 'cluster'"
type Kueue struct {
metav1.TypeMeta `json:",inline"`
// metadata for kueue
// +optional
metav1.ObjectMeta `json:"metadata,omitempty"`

// spec holds user settable values for configuration
// +required
Spec KueueOperandSpec `json:"spec"`
// status holds observed values from the cluster. They may not be overridden.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is KueueSpec an option? It has an operator object in it, and also we use KueueStatus

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoelSpeed @everettraven
WDYT of this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a stutter..

So should we just make it Spec and Status actually?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the user perspective, for the Kueue configuration it's spec.config right? Where's the stutter?

// +optional
Status KueueStatus `json:"status,omitempty"`
}

type KueueOperandSpec struct {
operatorv1.OperatorSpec `json:",inline"`
// config is the desired configuration
// for the Kueue operator.
// +required
Config KueueConfiguration `json:"config"`
}

type KueueConfiguration struct {
// integrations is a required field that configures the Kueue's workload integrations.
// Kueue has both standard integrations, known as job frameworks, and external integrations known as external frameworks.
// Kueue will only manage workloads that correspond to the specified integrations.
// +required
Integrations Integrations `json:"integrations"`
// queueLabelPolicy controls how kueue manages workloads
// The default behavior of Kueue will manage workloads that have a queue-name label.
// This field is optional.
// +optional
QueueLabelPolicy QueueLabelPolicy `json:"queueLabelPolicy,omitempty"`
// kueueGangSchedulingPolicy controls how Kueue admits workloads.
// Gang Scheduling is the act of all or nothing scheduling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to expand what this means, in terms of eviction.

Suggested change
// Gang Scheduling is the act of all or nothing scheduling.
// Gang Scheduling is the act of all or nothing scheduling,
// where workloads do not become ready within a certain period, they may be evicted and later retried.

// Kueue provides this ability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line adds particularly much for an end user, I think it's kinda implicit?

// This field is optional.
// +optional
KueueGangSchedulingPolicy KueueGangSchedulingPolicy `json:"kueueGangSchedulingPolicy,omitempty"`
// preemption is the process of evicting one or more admitted Workloads to accommodate another Workload.
// Kueue has classical premption and preemption via fair sharing.
// +optional
Preemption Preemption `json:"preemption,omitempty"`
}

// KueueStatus defines the observed state of Kueue
type KueueStatus struct {
operatorv1.OperatorStatus `json:",inline"`
}

// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object

// KueueList contains a list of Kueue
//
// Compatibility level 4: No compatibility is provided, the API can change at any point for any reason. These capabilities should not be used by applications needing long term support.
// +openshift:compatibility-gen:level=4
type KueueList struct {
metav1.TypeMeta `json:",inline"`
// metadata for the list
// +optional
metav1.ListMeta `json:"metadata,omitempty"`
// items is a slice of Kueue
// this is a cluster scoped resource and there can only be 1 Kueue
// +kubebuilder:validation:MaxItems=1
// +required
Items []Kueue `json:"items"`
}

// +kubebuilder:validation:Enum=BatchJob;RayJob;RayCluster;JobSet;MPIJob;PaddleJob;PytorchJob;TFJob;XGBoostJob;AppWrapper;Pod;Deployment;StatefulSet;LeaderWorkerSet
type KueueIntegration string

const (
KueueIntegrationBatchJob KueueIntegration = "BatchJob"
KueueIntegrationRayJob KueueIntegration = "RayJob"
KueueIntegrationRayCluster KueueIntegration = "RayCluster"
KueueIntegrationJobSet KueueIntegration = "JobSet"
KueueIntegrationMPIJob KueueIntegration = "MPIJob"
KueueIntegrationPaddeJob KueueIntegration = "PaddeJob"
KueueIntegrationPyTorchJob KueueIntegration = "PyTorchJob"
KueueIntegrationTFJob KueueIntegration = "TFJob"
KueueIntegrationXGBoostJob KueueIntegration = "XGBoostJob"
KueueIntegrationAppWrapper KueueIntegration = "AppWrapper"
KueueIntegrationPod KueueIntegration = "Pod"
KueueIntegrationDeployment KueueIntegration = "Deployment"
KueueIntegrationStatefulSet KueueIntegration = "StatefulSet"
KueueIntegrationLeaderWorkerSet KueueIntegration = "LeaderWorkerSet"
)

// This is the GVR for an external framework.
// Controller runtime requires this in this format
// for api discoverability.
type ExternalFramework struct {
// group is the API group of the externalFramework.
// Must be a valid DNS 1123 subdomain consisting of of lower-case alphanumeric characters,
// hyphens and periods, of at most 253 characters in length.
// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// +kubebuilder:validation:MaxLength=253
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1123Label().validate(self).hasValue()",message="a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure you won't need the size == 0 check, since you have min length 1

// +required
Group string `json:"group"`
// resource is the Resource type of the external framework.
// Resource types are lowercase and plural (e.g. pods, deployments).
// Must be a valid DNS 1123 label consisting of a lower-case alphanumeric string
// and hyphens of at most 63 characters in length.
// The value must start and end with an alphanumeric character.
// +kubebuilder:validation:MaxLength=63
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1123Label().validate(self).hasValue()",message="a lowercase RFC 1123 label must consist of lower case alphanumeric characters and '-', and must start and end with an alphanumeric character."
// +required
Resource string `json:"resource"`
// version is the version of the api (e.g. v1alpha1, v1beta1, v1).
// Must be a valid DNS 1035 label consisting of a lower-case alphanumeric string
// and hyphens of at most 63 characters in length.
// The value must start with an alphabetic character and end with an alphanumeric character.
// +kubebuilder:validation:MaxLength=63
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:XValidation:rule="self.size() == 0 || !format.dns1035Label().validate(self).hasValue()",message="a lowercase RFC 1035 label must consist of lower case alphanumeric characters, '-' or '.', and must start with an alphabetic character and end with an alphanumeric character."
// +required
Version string `json:"version"`
}

// This is the integrations for Kueue.
// Kueue uses these apis to determine
// which jobs will be managed by Kueue.
type Integrations struct {
// frameworks are a unique list of names to be enabled.
// This is required and must have at least one element.
// Each framework represents a type of job that Kueue will manage.
// Frameworks are a list of frameworks that Kueue has support for.
// The allowed values are BatchJob, RayJob, RayCluster, JobSet, MPIJob, PaddleJob, PytorchJob, TFJob, XGBoostJob, AppWrapper, Pod, Deployment, StatefulSet and LeaderWorkerSet.
// +kubebuilder:validation:MaxItems=14
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:XValidation:rule="self.all(x, self.exists_one(y, x == y))",message="each item in frameworks must be unique"
// +listType=atomic
// +required
Frameworks []KueueIntegration `json:"frameworks"`
// externalFrameworks are a list of GroupVersionResources
// that are managed for Kueue by external controllers.
// These are optional and should only be used if you have an external controller
// that integrates with Kueue.
// +listType=atomic
// +kubebuilder:validation:MaxItems=32
// +optional
ExternalFrameworks []ExternalFramework `json:"externalFrameworks,omitempty"`
// labelKeysToCopy are a list of label keys that are copied once a workload is created.
// These keys are persisted to the internal Kueue workload object.
// If not specified, only the Kueue labels will be copied.
// +kubebuilder:validation:MaxItems=64
// +listType=atomic
// +optional
LabelKeysToCopy []LabelKeys `json:"labelKeysToCopy,omitempty"`
}

type LabelKeys struct {
// key is the label key
// A label key must be a valid qualified name consisting of a lower-case alphanumeric string,
// and hyphens of at most 63 characters in length.
// The name must start and end with an alphanumeric character.
// The name may be optionally prefixed with a subdomain consisting of lower-case alphanumeric characters,
// hyphens and periods, of at most 253 characters in length.
// Each period separated segment within the subdomain must start and end with an alphanumeric character.
// The optional prefix and the name are separate by a forward slash (/).
// +kubebuilder:validation:MaxLength=317
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:XValidation:rule="!format.qualifiedName().validate(self).hasValue()",message="a qualified name must consist of a lower-case alphanumeric and hyphenated string of at most 63 characters in length, starting and ending with alphanumeric chracters. The name may be optionally prefixed with a subdomain consisting of lower-case alphanumeric characters, hyphens and periods, of at most 253 characters in length. Each period separated segment within the subdomain must start and end with an alphanumeric character."
// +optional
Key string `json:"key,omitempty"`
}

// +kubebuilder:validation:Enum=ByWorkload;Disabled
type KueueGangSchedulingPolicyOptions string

const (
KueueGangSchedulingPolicyEvictNotReadyWorkloads KueueGangSchedulingPolicyOptions = "ByWorkload"
KueueGangSchedulingPolicyDisabled KueueGangSchedulingPolicyOptions = "Disabled"
)

// +kubebuilder:validation:Enum=Parallel;Sequential
type KueueGangSchedulingAdmissionOptions string

const (
KueueGangSchedulingAdmissionOptionsSequential KueueGangSchedulingAdmissionOptions = "Sequential"
KueueGangSchedulingAdmissionOptionsParallel KueueGangSchedulingAdmissionOptions = "Parallel"
)

// Kueue provides the ability to admit workloads all in one (gang admission)
// and evicts workloads if they are not ready within a specific time.
type KueueGangSchedulingPolicy struct {
// policy allows for changing the kinds of gang scheduling Kueue does.
// This is an optional field.
// The allowed values are ByWorkload and Disabled.
// The default value will be Disabled.
// ByWorkload allows for configuration how admission is performed
// for Kueue.
// +optional
Policy KueueGangSchedulingPolicyOptions `json:"policy"`
// byWorkload controls how admission is done.
// When admission is set to Sequential, only pods from the currently processing workload will be admitted.
// Once all pods from the current workload are admitted, and ready, Kueue will process the next workload.
// Sequential processing may slow down admission when the cluster has sufficient capacity for multiple workloads,
// but provides a higher guarantee of workloads scheduling all pods together successfully.
// When set to Parallel, pods from any workload will be admitted at any time.
// This may lead to a deadlock where workloads are in contention for cluster capacity and
// pods from another workload having successfully scheduled prevent pods from the current workload scheduling.
// +kubebuilder:validation:XValidation:rule="self.policy==ByWorkload",message="byWorkload is only valid if policy equals ByWorkload"
// +optional
ByWorkload KueueGangSchedulingAdmissionOptions `json:"byWorkload"`
}

// +kubebuilder:validation:Enum=QueueNameRequired;QueueNameOptional
type QueueLabelNamePolicy string

const (
QueueLabelNamePolicyRequired QueueLabelNamePolicy = "QueueNameRequired"
QueueLabelNamePolicyOptional QueueLabelNamePolicy = "QueueNameOptional"
)

type QueueLabelPolicy struct {
// queueLabelPolicy controls whether or not Kueue reconciles
// jobs that don't set the label kueue.x-k8s.io/queue-name.
// The allowed values are QueueNameRequired and QueueNameOptional.
// If set to QueueNameRequired, then those jobs will be suspended and never started unless
// they are assigned a queue and eventually admitted. This also applies to
// jobs created before starting the kueue controller.
// Defaults to QueueNameRequired; therefore, those jobs are not managed and if they are created
// unsuspended, they will start immediately.
// +optional
QueueLabelPolicy QueueLabelNamePolicy `json:"queueLabelPolicy"`
}

// +kubebuilder:validation:Enum=Classical;FairSharing
type PreemptionStrategy string

const (
PreemeptionStrategyClassical PreemptionStrategy = "Classical"
PreemeptionStrategyFairsharing PreemptionStrategy = "FairSharing"
)

type Preemption struct {
// preemptionStrategy are the types of preemption kueue allows.
// Kueue has two types of preemption: classical and fair sharing.
// Classical means that an incoming workload, which does
// not fit within the unusued quota, is eligible to issue preemptions
// when the requests of the workload are below the
// resource flavor's nominal quota or borrowWithinCohort is enabled
// on the Cluster Queue.
// FairSharing is a more heavy weight algorithm.
// ClusterQueues with pending Workloads can preempt other Workloads
// in their cohort until the preempting ClusterQueue
// obtains an equal or weighted share of the borrowable resources.
// The borrowable resources are the unused nominal quota
// of all the ClusterQueues in the cohort.
// +optional
PreemptionStrategy PreemptionStrategy `json:"preemptionStrategy"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the upstream configuration allows configuring multiple preemption strategies that run in a chain. Are we not planning to allow that and forcing the choice of one or the other?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR it is really hard to explain those policies and the customer ask right now is to enable the kueue defaults for FairSharing. Not to provide all possible items.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure that we have a structure that allows us in the future to add those options in a non-breaking way.

If you have to expose those options in the future, how would you add that into this? Would this become a discriminated union?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking I can add a new enum and set the values I need without having to explain the kueue APIs as is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 1 ask for enabling FairSharing from IBM and they are happy with the defaults. If someone really wants to tune it and figure out a better strategy I think we should push them to engage with upstream. And we can always default to a setting in our controller.

I really want to avoid having to explain and document a very complicated and quickly changing part of the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we are saying you have to support it now. We are looking to get an idea of how, if a request for this comes in, you plan to enable that functionality in a non-breaking way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone really wants to tune it and figure out a better strategy I think we should push them to engage with upstream

Am I correct in understanding this as essentially boiling down to "if you don't like our opinions, use the upstream version"?

Copy link
Author

@kannon92 kannon92 Apr 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. So upstream has this to say about this Fair sharing policies.

Different preemptionStrategies can lead to less or more preemptions under specific scenarios. These are the factors you should consider when configuring preemptionStrategies:

  • Tolerance to disruptions, in particular when single Workloads use a significant amount of the borrowable resources.
  • Speed of convergence, in other words, how important is it to reach a steady fair state as soon as possible.
  • Overall utilization, because certain strategies might reduce the utilization of the cluster in the pursue of fairness.

I was thinking that if a user really needs to configure this, they can figure out how they want it configured via upstream, bring up a RFE and we can support their request. If we expose the enums from Kueue, we are hoping that these fair sharing enums are not going to be deprecated. I wanted a high level API over this so we can default this over releases rather than having a direct api dependency.

I can replicate https://github.com/kubernetes-sigs/kueue/blob/main/apis/config/v1beta1/configuration_types.go#L445 exactly in this API if you think that would be best.

We can add a fair sharing struct here and copy these enums from upstream. And leave it to the cluster admin to figure out the best way to configure this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other complication is https://github.com/kubernetes-sigs/kueue/pull/4252/files.

Looking at this, it actually seems that maybe our Preemption Policy may not be the best name as this would be fair sharing without any prememption.

}
Loading