@@ -8,7 +8,11 @@ classes: wide
8
8
9
9
The AppWrapper controller is designed to enhance and extend the fault
10
10
tolerance capabilities provided by the controllers of its wrapped
11
- resources. Throughout the execution of a workload, the AppWrapper
11
+ resources. If [ Autopilot] ( https://github.com/ibm/autopilot ) is deployed on the
12
+ cluster, the AppWrapper controller can automate both the injection of
13
+ Node anti-affinites to avoid scheduling workloads on unhealthy Nodes
14
+ and the migration of running workloads away from unhealthy Nodes.
15
+ Throughout the execution of a workload, the AppWrapper
12
16
controller monitors both the status of the contained top-level
13
17
resources and the status of all Pods created by the workload. If a
14
18
workload is determined to be * unhealthy* , the AppWrapper controller
@@ -21,7 +25,6 @@ engineered to ensure that it will always make progress and eventually
21
25
succeed in completely removing all Pods and other resources created by
22
26
a failed workload.
23
27
24
-
25
28
``` mermaid!
26
29
---
27
30
title: Overview of AppWrapper Fault Tolerance Phase Transitions
@@ -89,6 +92,8 @@ following conditions are true:
89
92
number of Pods to reach the ` Pending ` state.
90
93
+ It takes longer than the ` WarmupGracePeriod ` for the expected
91
94
number of Pods to reach the ` Running ` state.
95
+ + If a non-zero number of ` Running ` Pods are using resources
96
+ that Autopilot has tagged as unhealthy.
92
97
+ A top-level resource is missing.
93
98
+ The status information of a batch/v1 Job or PyTorchJob indicates
94
99
that it has failed.
@@ -97,7 +102,7 @@ If a workload is determined to be unhealthy by one of the first three
97
102
Pod-level conditions above, the AppWrapper controller first waits for
98
103
a ` FailureGracePeriod ` to allow the primary resource controller an
99
104
opportunity to react and return the workload to a healthy state. The
100
- ` FailureGracePeriod ` is elided by the last two conditions because the
105
+ ` FailureGracePeriod ` is elided by the remaining conditions because the
101
106
primary resource controller is not expected to take any further
102
107
action. If the ` FailureGracePeriod ` passes and the workload is still
103
108
unhealthy, the AppWrapper controller will * reset* the workload by
@@ -112,7 +117,8 @@ then the AppWrapper moves into a `Failed` state and its resources are deleted
112
117
(thus finally releasing its quota). If at any time during this retry loop,
113
118
an AppWrapper is suspended (ie, Kueue decides to preempt the AppWrapper),
114
119
the AppWrapper controller will respect this request by proceeding to delete
115
- the resources.
120
+ the resources. Workload resets that are initiated in response to Autopilot
121
+ are subject to the ` RetryLimit ` but do not increment the ` retryCount ` .
116
122
117
123
To support debugging ` Failed ` workloads, an annotation can be added to an
118
124
AppWrapper that adds a ` DeletionOnFailureGracePeriod ` between the time the
@@ -121,6 +127,13 @@ begins. Since the AppWrapper continues to consume quota during this delayed dele
121
127
this annotation should be used sparingly and only when interactive debugging of
122
128
the failed workload is being actively pursued.
123
129
130
+ An AppWrapper can be annotated as ` autopilotExempt ` to disable the
131
+ injection of Autopilot Node anti-affinities into its Pods and the
132
+ automatic migration of its Pods away from Nodes with Autopilot tagged
133
+ unhealthy resources. This annotation should only be used for workloads
134
+ that will be closely monitored by other means to identify and recover from
135
+ unhealthy Nodes in the cluster.
136
+
124
137
All child resources for an AppWrapper that successfully completed will be automatically
125
138
deleted after a ` SuccessTTL ` after the AppWrapper entered the ` Succeeded ` state.
126
139
@@ -141,7 +154,35 @@ can be used to customize them.
141
154
| DeletionOnFailureGracePeriod | 0 Seconds | workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration |
142
155
| ForcefulDeletionGracePeriod | 10 Minutes | workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration |
143
156
| SuccessTTL | 7 Days | workload.codeflare.dev.appwrapper/successTTLDuration |
157
+ | AutopilotExempt | false | workload.codeflare.dev.appwrapper/autopilotExempt |
144
158
| GracePeriodMaximum | 24 Hours | Not Applicable |
145
159
146
160
The ` GracePeriodMaximum ` imposes a system-wide upper limit on all other grace periods to
147
161
limit the potential impact of user-added annotations on overall system utilization.
162
+
163
+ The set of resources monitored by Autopilot and the associated labels that identify unhealthy
164
+ resources can be customized as part of the AppWrapper operator's configuration. The default
165
+ Autopilot configuration used by the controller is:
166
+ ``` yaml
167
+ autopilot :
168
+ injectAntiAffinities : true
169
+ migrateImpactedWorkloads : true
170
+ resourceUnhealthyConfig :
171
+ nvidia.com/gpu :
172
+ autopilot.ibm.com/gpuhealth : ERR
173
+ ` ` `
174
+
175
+ The ` resourceUnhealthyConfig` is a map from resource names to labels. For this example
176
+ configuration, for exactly those Pods that have a non-zero resource request for
177
+ ` nvidia.com/gpu` , the AppWrapper controller will automatically inject the stanze below
178
+ into the `affinity` portion of their Spec.
179
+ ` ` ` yaml
180
+ nodeAffinity:
181
+ requiredDuringSchedulingIgnoredDuringExecution:
182
+ nodeSelectorTerms:
183
+ - matchExpressions:
184
+ - key: autopilot.ibm.com/gpuhealth
185
+ operator: NotIn
186
+ values:
187
+ - ERR
188
+ ` ` `
0 commit comments