Add Autopilot documentation to website (#204)

dgrove-oss · web-flow · commit b1a2002a19da · 2024-07-19T14:40:27.000-04:00
* add MLBatch and Autopilot to main AppWrapper README
* document Autopilot configuration
diff --git a/README.md b/README.md
@@ -26,11 +26,29 @@ are not taken by the primary resource controllers within specified deadlines,
 the AppWrapper controller will orchestrate workload-level retries and
 resource deletion to ensure that either the workload returns to a
 healthy state or is cleanly removed from the cluster and its quota
-freed for use by other workloads.  For details on customizing and
+freed for use by other workloads. If [Autopilot](https://github.com/ibm/autopilot)
+is also being used on the cluster, the AppWrapper controller can be configured
+to automatically inject Node anti-affinities into Pods and to trigger
+retries when Pods in already running workloads are using resources
+that Autopilot has tagged as unhealthy. For details on customizing and
 configuring these fault tolerance capabilities, please see the
 [Fault Tolerance](https://project-codeflare.github.io/appwrapper/arch-controller/)
 section of our website.
 
+AppWrappers are designed to be used as part of fully open source software stack
+to run production batch workloads on Kubernetes and OpenShift. The [MLBatch](https://github.com/project-codeflare/mlbatch)
+project leverages [Kueue](https://kueue.sigs.k8s.io), the [Kubeflow Training
+Operator](https://www.kubeflow.org/docs/components/training/),
+[KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html), and the
+[Codeflare Operator](https://github.com/project-codeflare/codeflare-operator)
+from [Red Hat OpenShift
+AI](https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai).
+MLBatch enables [AppWrappers](https://project-codeflare.github.io/appwrapper/)
+and adds
+[Coscheduler](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md).
+MLBatch includes a number of configuration steps to help these components work
+in harmony and support large workloads on large clusters.
+
 ## Installation
 
 To install the latest release of AppWrapper in a Kubernetes cluster with Kueue already installed
diff --git a/site/_pages/arch-fault-tolerance.md b/site/_pages/arch-fault-tolerance.md
@@ -8,7 +8,11 @@ classes: wide
 
 The AppWrapper controller is designed to enhance and extend the fault
 tolerance capabilities provided by the controllers of its wrapped
-resources. Throughout the execution of a workload, the AppWrapper
+resources. If [Autopilot](https://github.com/ibm/autopilot) is deployed on the
+cluster, the AppWrapper controller can automate both the injection of
+Node anti-affinites to avoid scheduling workloads on unhealthy Nodes
+and the migration of running workloads away from unhealthy Nodes.
+Throughout the execution of a workload, the AppWrapper
 controller monitors both the status of the contained top-level
 resources and the status of all Pods created by the workload. If a
 workload is determined to be *unhealthy*, the AppWrapper controller
@@ -21,7 +25,6 @@ engineered to ensure that it will always make progress and eventually
 succeed in completely removing all Pods and other resources created by
 a failed workload.
 
-
 ```mermaid!
 ---
 title: Overview of AppWrapper Fault Tolerance Phase Transitions
@@ -89,6 +92,8 @@ following conditions are true:
      number of Pods to reach the `Pending` state.
    + It takes longer than the `WarmupGracePeriod` for the expected
      number of Pods to reach the `Running` state.
+   + If a non-zero number of `Running` Pods are using resources
+     that Autopilot has tagged as unhealthy.
    + A top-level resource is missing.
    + The status information of a batch/v1 Job or PyTorchJob indicates
      that it has failed.
@@ -97,7 +102,7 @@ If a workload is determined to be unhealthy by one of the first three
 Pod-level conditions above, the AppWrapper controller first waits for
 a `FailureGracePeriod` to allow the primary resource controller an
 opportunity to react and return the workload to a healthy state. The
-`FailureGracePeriod` is elided by the last two conditions because the
+`FailureGracePeriod` is elided by the remaining conditions because the
 primary resource controller is not expected to take any further
 action. If the `FailureGracePeriod` passes and the workload is still
 unhealthy, the AppWrapper controller will *reset* the workload by
@@ -112,7 +117,8 @@ then the AppWrapper moves into a `Failed` state and its resources are deleted
 (thus finally releasing its quota).  If at any time during this retry loop,
 an AppWrapper is suspended (ie, Kueue decides to preempt the AppWrapper),
 the AppWrapper controller will respect this request by proceeding to delete
-the resources.
+the resources. Workload resets that are initiated in response to Autopilot
+are subject to the `RetryLimit` but do not increment the `retryCount`.
 
 To support debugging `Failed` workloads, an annotation can be added to an
 AppWrapper that adds a `DeletionOnFailureGracePeriod` between the time the
@@ -121,6 +127,13 @@ begins. Since the AppWrapper continues to consume quota during this delayed dele
 this annotation should be used sparingly and only when interactive debugging of
 the failed workload is being actively pursued.
 
+An AppWrapper can be annotated as `autopilotExempt` to disable the
+injection of Autopilot Node anti-affinities into its Pods and the
+automatic migration of its Pods away from Nodes with Autopilot tagged
+unhealthy resources. This annotation should only be used for workloads
+that will be closely monitored by other means to identify and recover from
+unhealthy Nodes in the cluster.
+
 All child resources for an AppWrapper that successfully completed will be automatically
 deleted after a `SuccessTTL` after the AppWrapper entered the `Succeeded` state.
 
@@ -141,7 +154,35 @@ can be used to customize them.
 | DeletionOnFailureGracePeriod |     0 Seconds | workload.codeflare.dev.appwrapper/deletionOnFailureGracePeriodDuration |
 | ForcefulDeletionGracePeriod  |    10 Minutes | workload.codeflare.dev.appwrapper/forcefulDeletionGracePeriodDuration  |
 | SuccessTTL                   |        7 Days | workload.codeflare.dev.appwrapper/successTTLDuration                   |
+| AutopilotExempt              |         false | workload.codeflare.dev.appwrapper/autopilotExempt                      |
 | GracePeriodMaximum           |      24 Hours | Not Applicable                                                         |
 
 The `GracePeriodMaximum` imposes a system-wide upper limit on all other grace periods to
 limit the potential impact of user-added annotations on overall system utilization.
+
+The set of resources monitored by Autopilot and the associated labels that identify unhealthy
+resources can be customized as part of the AppWrapper operator's configuration.  The default
+Autopilot configuration used by the controller is:
+```yaml
+autopilot:
+  injectAntiAffinities: true
+  migrateImpactedWorkloads: true
+  resourceUnhealthyConfig:
+    nvidia.com/gpu:
+      autopilot.ibm.com/gpuhealth: ERR
+```
+
+The `resourceUnhealthyConfig` is a map from resource names to labels. For this example
+configuration, for exactly those Pods that have a non-zero resource request for
+`nvidia.com/gpu`, the AppWrapper controller will automatically inject the stanze below
+into the `affinity` portion of their Spec.
+```yaml
+      nodeAffinity:
+        requiredDuringSchedulingIgnoredDuringExecution:
+          nodeSelectorTerms:
+          - matchExpressions:
+            - key: autopilot.ibm.com/gpuhealth
+              operator: NotIn
+              values:
+              - ERR
+```
diff --git a/site/_pages/overview.md b/site/_pages/overview.md
@@ -21,6 +21,27 @@ additional level of automatic fault detection and recovery. The AppWrapper
 controller monitors the health of the workload and if corrective actions
 are not taken by the primary resource controllers within specified deadlines,
 the AppWrapper controller will orchestrate workload-level retries and
-resource deletions to ensure that either the workload returns to a
-healthy state or it is cleanly removed from the cluster and its quota
-freed for use by other workloads.
+resource deletion to ensure that either the workload returns to a
+healthy state or is cleanly removed from the cluster and its quota
+freed for use by other workloads. If [Autopilot](https://github.com/ibm/autopilot)
+is also being used on the cluster, the AppWrapper controller can be configured
+to automatically inject Node anti-affinities into Pods and to trigger
+retries when Pods in already running workloads are using resources
+that Autopilot has tagged as unhealthy. For details on customizing and
+configuring these fault tolerance capabilities, please see the
+[Fault Tolerance](https://project-codeflare.github.io/appwrapper/arch-controller/)
+section of our website.
+
+AppWrappers are designed to be used as part of fully open source software stack
+to run production batch workloads on Kubernetes and OpenShift. The [MLBatch](https://github.com/project-codeflare/mlbatch)
+project leverages [Kueue](https://kueue.sigs.k8s.io), the [Kubeflow Training
+Operator](https://www.kubeflow.org/docs/components/training/),
+[KubeRay](https://docs.ray.io/en/latest/cluster/kubernetes/index.html), and the
+[Codeflare Operator](https://github.com/project-codeflare/codeflare-operator)
+from [Red Hat OpenShift
+AI](https://www.redhat.com/en/technologies/cloud-computing/openshift/openshift-ai).
+MLBatch enables [AppWrappers](https://project-codeflare.github.io/appwrapper/)
+and adds
+[Coscheduler](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/coscheduling/README.md).
+MLBatch includes a number of configuration steps to help these components work
+in harmony and support large workloads on large clusters.