Skip to content

Commit 031b037

Browse files
authored
expand and update fault tolerance documentation (#188)
1 parent 1e9b1bb commit 031b037

File tree

1 file changed

+93
-29
lines changed

1 file changed

+93
-29
lines changed

site/_pages/arch-fault-tolerance.md

Lines changed: 93 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -4,28 +4,106 @@ title: "Fault Tolerance"
44
classes: wide
55
---
66

7-
### Overall Design
7+
### Overview of Capabilities
88

9-
The `podSets` contained in the AppWrapper specification enable the AppWrapper
10-
controller to inject labels into every Pod that is created by
11-
the workload during its execution. Throughout the execution of the
9+
The AppWrapper controller is designed to enhance and extend the fault
10+
tolerance capabilities provided by the controllers of its wrapped
11+
resources. Throughout the execution of a workload, the AppWrapper
12+
controller monitors both the status of the contained top-level
13+
resources and the status of all Pods created by the workload. If a
14+
workload is determined to be *unhealthy*, the AppWrapper controller
15+
firsts waits for a bounded time period to allow the underlying
16+
controllers to correct the problem. If they fail to do so, then the
17+
AppWrapper controller will *reset* the workload by removing all
18+
created resources, and then, if the maximum number of retires has not
19+
been exceeded, recreating the workload. This reset process is carefully
20+
engineered to ensure that it will always make progress and eventually
21+
succeed in completely removing all Pods and other resources created by
22+
a failed workload.
23+
24+
25+
```mermaid!
26+
---
27+
title: Overview of AppWrapper Fault Tolerance Phase Transitions
28+
---
29+
stateDiagram-v2
30+
31+
rn : Running
32+
s : Succeeded
33+
f : Failed
34+
rt : Resetting
35+
rs : Resuming
36+
37+
%% Happy Path
38+
rn --> s
39+
40+
%% Requeuing
41+
rn --> f : Retries Exceeded
42+
rn --> rt : Workload Unhealthy
43+
rt --> rs : All Resources Removed
44+
rs --> rn : All Resources Recreated
45+
46+
classDef quota fill:lightblue
47+
class rs quota
48+
class rn quota
49+
class rt quota
50+
51+
classDef failed fill:pink
52+
class f failed
53+
54+
classDef succeeded fill:lightgreen
55+
class s succeeded
56+
```
57+
58+
### Progress Guarantees
59+
60+
When the AppWrapper controller decides to delete the resources for a
61+
workload, it proceeds through several phases. First it does a normal
62+
delete of the top-level resources, allowing the primary resource
63+
controllers time to cascade the deletion through all child resources.
64+
If they are not able to successfully delete all of the workload's Pods
65+
and resources within a `ForcefulDeletionGracePeriod`, the AppWrapper
66+
controller then initiates a *forceful* deletion of all remaining Pods
67+
and resources by deleting them with a `GracePeriod` of `0`. An
68+
AppWrapper will continue to have its `ResourcesDeployed` condition to
69+
be `True` until all resources and Pods are successfully deleted.
70+
71+
This process ensures that when `ResourcesDeployed` becomes `False`,
72+
which indicates to Kueue that the quota has been released, all
73+
resources created by a failed workload will have been totally removed
74+
from the cluster.
75+
76+
### Detailed Description
77+
78+
The `podSets` contained in the AppWrapper specification enable the
79+
AppWrapper controller to inject labels into every Pod that is created
80+
by the workload during its execution. Throughout the execution of the
1281
workload, the AppWrapper controller monitors the number and health of
13-
all labeled Pods and uses this information to determine if a
14-
workload is unhealthy. A workload can be deemed *unhealthy* if any of
15-
the following conditions are true:
82+
all labeled Pods. It also watches the top-level created resources and
83+
for selected resources types understands how to interpret their status
84+
information. This information is combined to determine if a workload
85+
is unhealthy. A workload can be deemed *unhealthy* if any of the
86+
following conditions are true:
1687
+ There are a non-zero number of `Failed` Pods.
1788
+ It takes longer than `AdmissionGracePeriod` for the expected
1889
number of Pods to reach the `Pending` state.
1990
+ It takes longer than the `WarmupGracePeriod` for the expected
2091
number of Pods to reach the `Running` state.
92+
+ A top-level resource is missing.
93+
+ The status information of a batch/v1 Job or PyTorchJob indicates
94+
that it has failed.
95+
96+
If a workload is determined to be unhealthy by one of the first three
97+
Pod-level conditions above, the AppWrapper controller first waits for
98+
a `FailureGracePeriod` to allow the primary resource controller an
99+
opportunity to react and return the workload to a healthy state. The
100+
`FailureGracePeriod` is elided by the last two conditions because the
101+
primary resource controller is not expected to take any further
102+
action. If the `FailureGracePeriod` passes and the workload is still
103+
unhealthy, the AppWrapper controller will *reset* the workload by
104+
deleting its resources, waiting for a `RetryPausePeriod`, and then
105+
creating new instances of the resources.
21106

22-
If a workload is determined to be unhealthy, the AppWrapper controller
23-
first waits for a `FailureGracePeriod` to allow the primary resource
24-
controller an opportunity to react and return the workload to a
25-
healthy state. If the `FailureGracePeriod` passes and the workload
26-
is still unhealthy, the AppWrapper controller will *reset* the workload by
27-
deleting its resources, waiting for a `RetryPausePeriod`, and then creating
28-
new instances of the resources.
29107
During this retry pause, the AppWrapper **does not** release the workload's
30108
quota; this ensures that when the resources are recreated they will still
31109
have sufficient quota to execute. The number of times an AppWrapper is reset
@@ -44,21 +122,7 @@ this annotation should be used sparingly and only when interactive debugging of
44122
the failed workload is being actively pursued.
45123

46124
All child resources for an AppWrapper that successfully completed will be automatically
47-
deleted after a `SuccessTTLPeriod` after the AppWrapper entered the `Succeeded` state.
48-
49-
When the AppWrapper controller decides to delete the resources for a workload,
50-
it proceeds through several phases. First it does a normal delete of the
51-
resources, allowing the primary resource controllers time to cascade the deletion
52-
through all child resources. If they are not able to successfully delete
53-
all of the workload's Pods and resources within a `ForcefulDeletionGracePeriod`,
54-
the AppWrapper controller then initiates a *forceful*
55-
deletion of all remaining Pods and resources by deleting them with a `GracePeriod` of `0`.
56-
An AppWrapper will continue to have its `ResourcesDeployed` condition to be
57-
`True` until all resources and Pods are successfully deleted.
58-
59-
This process ensures that when `ResourcesDeployed` becomes `False`, which
60-
indicates to Kueue that the quota has been released, all resources created by
61-
a failed workload will have been totally removed from the cluster.
125+
deleted after a `SuccessTTL` after the AppWrapper entered the `Succeeded` state.
62126

63127
### Configuration Details
64128

0 commit comments

Comments
 (0)