Skip to content

Commit b1a2002

Browse files
authored
Add Autopilot documentation to website (#204)
* add MLBatch and Autopilot to main AppWrapper README * document Autopilot configuration
1 parent 90d5af8 commit b1a2002

File tree

3 files changed

+88
-8
lines changed

3 files changed

+88
-8
lines changed

README.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,29 @@ are not taken by the primary resource controllers within specified deadlines,
2626
the AppWrapper controller will orchestrate workload-level retries and
2727
resource deletion to ensure that either the workload returns to a
2828
healthy state or is cleanly removed from the cluster and its quota
29-
freed for use by other workloads. For details on customizing and
29+
freed for use by other workloads. If [Autopilot](https://github.com/ibm/autopilot)
30+
is also being used on the cluster, the AppWrapper controller can be configured
31+
to automatically inject Node anti-affinities into Pods and to trigger
32+
retries when Pods in already running workloads are using resources
33+
that Autopilot has tagged as unhealthy. For details on customizing and
3034
configuring these fault tolerance capabilities, please see the
3135
[Fault Tolerance](https://project-codeflare.github.io/appwrapper/arch-controller/)
3236
section of our website.
3337

38+
AppWrappers are designed to be used as part of fully open source software stack
39+
to run production batch workloads on Kubernetes and OpenShift. The [MLBatch](https://github.com/project-codeflare/mlbatch)
40+
project leverages [Kueue](https://kueue.sigs.k8s.io), the [Kubeflow Training
41+
Operator](https://www.kubeflow.org/docs/components/training/),
42+
[KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html), and the
43+
[Codeflare Operator](https://github.com/project-codeflare/codeflare-operator)
44+
from [Red Hat OpenShift
45+
AI](https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai).
46+
MLBatch enables [AppWrappers](https://project-codeflare.github.io/appwrapper/)
47+
and adds
48+
[Coscheduler](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md).
49+
MLBatch includes a number of configuration steps to help these components work
50+
in harmony and support large workloads on large clusters.
51+
3452
## Installation
3553

3654
To install the latest release of AppWrapper in a Kubernetes cluster with Kueue already installed

site/_pages/arch-fault-tolerance.md

Lines changed: 45 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,11 @@ classes: wide
88

99
The AppWrapper controller is designed to enhance and extend the fault
1010
tolerance capabilities provided by the controllers of its wrapped
11-
resources. Throughout the execution of a workload, the AppWrapper
11+
resources. If [Autopilot](https://github.com/ibm/autopilot) is deployed on the
12+
cluster, the AppWrapper controller can automate both the injection of
13+
Node anti-affinites to avoid scheduling workloads on unhealthy Nodes
14+
and the migration of running workloads away from unhealthy Nodes.
15+
Throughout the execution of a workload, the AppWrapper
1216
controller monitors both the status of the contained top-level
1317
resources and the status of all Pods created by the workload. If a
1418
workload is determined to be *unhealthy*, the AppWrapper controller
@@ -21,7 +25,6 @@ engineered to ensure that it will always make progress and eventually
2125
succeed in completely removing all Pods and other resources created by
2226
a failed workload.
2327

24-
2528
```mermaid!
2629
---
2730
title: Overview of AppWrapper Fault Tolerance Phase Transitions
@@ -89,6 +92,8 @@ following conditions are true:
8992
number of Pods to reach the `Pending` state.
9093
+ It takes longer than the `WarmupGracePeriod` for the expected
9194
number of Pods to reach the `Running` state.
95+
+ If a non-zero number of `Running` Pods are using resources
96+
that Autopilot has tagged as unhealthy.
9297
+ A top-level resource is missing.
9398
+ The status information of a batch/v1 Job or PyTorchJob indicates
9499
that it has failed.
@@ -97,7 +102,7 @@ If a workload is determined to be unhealthy by one of the first three
97102
Pod-level conditions above, the AppWrapper controller first waits for
98103
a `FailureGracePeriod` to allow the primary resource controller an
99104
opportunity to react and return the workload to a healthy state. The
100-
`FailureGracePeriod` is elided by the last two conditions because the
105+
`FailureGracePeriod` is elided by the remaining conditions because the
101106
primary resource controller is not expected to take any further
102107
action. If the `FailureGracePeriod` passes and the workload is still
103108
unhealthy, the AppWrapper controller will *reset* the workload by
@@ -112,7 +117,8 @@ then the AppWrapper moves into a `Failed` state and its resources are deleted
112117
(thus finally releasing its quota). If at any time during this retry loop,
113118
an AppWrapper is suspended (ie, Kueue decides to preempt the AppWrapper),
114119
the AppWrapper controller will respect this request by proceeding to delete
115-
the resources.
120+
the resources. Workload resets that are initiated in response to Autopilot
121+
are subject to the `RetryLimit` but do not increment the `retryCount`.
116122

117123
To support debugging `Failed` workloads, an annotation can be added to an
118124
AppWrapper that adds a `DeletionOnFailureGracePeriod` between the time the
@@ -121,6 +127,13 @@ begins. Since the AppWrapper continues to consume quota during this delayed dele
121127
this annotation should be used sparingly and only when interactive debugging of
122128
the failed workload is being actively pursued.
123129

130+
An AppWrapper can be annotated as `autopilotExempt` to disable the
131+
injection of Autopilot Node anti-affinities into its Pods and the
132+
automatic migration of its Pods away from Nodes with Autopilot tagged
133+
unhealthy resources. This annotation should only be used for workloads
134+
that will be closely monitored by other means to identify and recover from
135+
unhealthy Nodes in the cluster.
136+
124137
All child resources for an AppWrapper that successfully completed will be automatically
125138
deleted after a `SuccessTTL` after the AppWrapper entered the `Succeeded` state.
126139

@@ -141,7 +154,35 @@ can be used to customize them.
141154
| DeletionOnFailureGracePeriod | 0 Seconds | workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration |
142155
| ForcefulDeletionGracePeriod | 10 Minutes | workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration |
143156
| SuccessTTL | 7 Days | workload.codeflare.dev.appwrapper/successTTLDuration |
157+
| AutopilotExempt | false | workload.codeflare.dev.appwrapper/autopilotExempt |
144158
| GracePeriodMaximum | 24 Hours | Not Applicable |
145159

146160
The `GracePeriodMaximum` imposes a system-wide upper limit on all other grace periods to
147161
limit the potential impact of user-added annotations on overall system utilization.
162+
163+
The set of resources monitored by Autopilot and the associated labels that identify unhealthy
164+
resources can be customized as part of the AppWrapper operator's configuration. The default
165+
Autopilot configuration used by the controller is:
166+
```yaml
167+
autopilot:
168+
injectAntiAffinities: true
169+
migrateImpactedWorkloads: true
170+
resourceUnhealthyConfig:
171+
nvidia.com/gpu:
172+
autopilot.ibm.com/gpuhealth: ERR
173+
```
174+
175+
The `resourceUnhealthyConfig` is a map from resource names to labels. For this example
176+
configuration, for exactly those Pods that have a non-zero resource request for
177+
`nvidia.com/gpu`, the AppWrapper controller will automatically inject the stanze below
178+
into the `affinity` portion of their Spec.
179+
```yaml
180+
nodeAffinity:
181+
requiredDuringSchedulingIgnoredDuringExecution:
182+
nodeSelectorTerms:
183+
- matchExpressions:
184+
- key: autopilot.ibm.com/gpuhealth
185+
operator: NotIn
186+
values:
187+
- ERR
188+
```

site/_pages/overview.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,27 @@ additional level of automatic fault detection and recovery. The AppWrapper
2121
controller monitors the health of the workload and if corrective actions
2222
are not taken by the primary resource controllers within specified deadlines,
2323
the AppWrapper controller will orchestrate workload-level retries and
24-
resource deletions to ensure that either the workload returns to a
25-
healthy state or it is cleanly removed from the cluster and its quota
26-
freed for use by other workloads.
24+
resource deletion to ensure that either the workload returns to a
25+
healthy state or is cleanly removed from the cluster and its quota
26+
freed for use by other workloads. If [Autopilot](https://github.com/ibm/autopilot)
27+
is also being used on the cluster, the AppWrapper controller can be configured
28+
to automatically inject Node anti-affinities into Pods and to trigger
29+
retries when Pods in already running workloads are using resources
30+
that Autopilot has tagged as unhealthy. For details on customizing and
31+
configuring these fault tolerance capabilities, please see the
32+
[Fault Tolerance](https://project-codeflare.github.io/appwrapper/arch-controller/)
33+
section of our website.
34+
35+
AppWrappers are designed to be used as part of fully open source software stack
36+
to run production batch workloads on Kubernetes and OpenShift. The [MLBatch](https://github.com/project-codeflare/mlbatch)
37+
project leverages [Kueue](https://kueue.sigs.k8s.io), the [Kubeflow Training
38+
Operator](https://www.kubeflow.org/docs/components/training/),
39+
[KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html), and the
40+
[Codeflare Operator](https://github.com/project-codeflare/codeflare-operator)
41+
from [Red Hat OpenShift
42+
AI](https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai).
43+
MLBatch enables [AppWrappers](https://project-codeflare.github.io/appwrapper/)
44+
and adds
45+
[Coscheduler](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md).
46+
MLBatch includes a number of configuration steps to help these components work
47+
in harmony and support large workloads on large clusters.

0 commit comments

Comments
 (0)