Skip to content

Commit ab02db3

Browse files
πŸ“– Document failureReason and Message are considered terminal errors (#10561)
* Document failureReason and Message are considered terminal errors * Address comments * Clarify what cannot be restored anymore means
1 parent 8e72a0e commit ab02db3

File tree

8 files changed

+32
-2
lines changed

8 files changed

+32
-2
lines changed

β€Ždocs/book/src/developer/architecture/controllers/cluster.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ is a map, defined as `map[string]FailureDomainSpec`. A unique key must be used f
5050
- `controlPlane` (bool): indicates if failure domain is appropriate for running control plane instances.
5151
- `attributes` (`map[string]string`): arbitrary attributes for users to apply to a failure domain.
5252

53+
Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the infrastructureCluster object,
54+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster).
55+
5356
Example:
5457
```yaml
5558
kind: MyProviderCluster

β€Ždocs/book/src/developer/architecture/controllers/control-plane.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -234,6 +234,9 @@ The `status` object **may** define several fields:
234234
exist in the cluster. For example, managed control plane providers for AKS, EKS, GKE, etc, should
235235
set this to `true`. Leaving the field undefined is equivalent to setting the value to `false`.
236236

237+
Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the control plane object,
238+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster).
239+
237240
## Example usage
238241

239242
```yaml

β€Ždocs/book/src/developer/architecture/controllers/machine-pool.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,9 @@ The `status` object **may** define several fields that do not affect functionali
6161
* `failureReason` - a string field explaining why a fatal error has occurred, if possible.
6262
* `failureMessage` - a string field that holds the message contained by the error.
6363

64+
Note: once any of `failureReason` or `failureMessage` surface on the machine pool who is referencing the bootstrap config object,
65+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine pool).
66+
6467
Example:
6568

6669
```yaml
@@ -97,7 +100,10 @@ The `status` object **may** define several fields that do not affect functionali
97100
* `failureMessage` - is a string that holds the message contained by the error.
98101
* `infrastructureMachineKind` - the kind of the InfraMachines. This should be set if the InfrastructureMachinePool plans to support MachinePool Machines.
99102

100-
**Note:** Infrastructure providers can support MachinePool Machines by having the InfraMachinePool set the `infrastructureMachineKind` to the kind of their InfrastructureMachines. The InfrastructureMachinePool will be responsible for creating InfrastructureMachines as the MachinePool is scaled up, and the MachinePool controller will create Machines for each InfrastructureMachine and set the ownerRef. The InfrastructureMachinePool will be responsible for deleting the Machines as the MachinePool is scaled down in order for the Machine deletion workflow to function properly. In addition, the InfrastructureMachines must also have the following labels set by the InfrastructureMachinePool: `cluster.x-k8s.io/cluster-name` and `cluster.x-k8s.io/pool-name`. The `MachinePoolNameLabel` must also be formatted with `capilabels.MustFormatValue()` so that it will not exceed character limits.
103+
Note: once any of `failureReason` or `failureMessage` surface on the machine pool who is referencing the InfrastructureMachinePool object,
104+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine pool).
105+
106+
Note: Infrastructure providers can support MachinePool Machines by having the InfraMachinePool set the `infrastructureMachineKind` to the kind of their InfrastructureMachines. The InfrastructureMachinePool will be responsible for creating InfrastructureMachines as the MachinePool is scaled up, and the MachinePool controller will create Machines for each InfrastructureMachine and set the ownerRef. The InfrastructureMachinePool will be responsible for deleting the Machines as the MachinePool is scaled down in order for the Machine deletion workflow to function properly. In addition, the InfrastructureMachines must also have the following labels set by the InfrastructureMachinePool: `cluster.x-k8s.io/cluster-name` and `cluster.x-k8s.io/pool-name`. The `MachinePoolNameLabel` must also be formatted with `capilabels.MustFormatValue()` so that it will not exceed character limits.
101107

102108
Example
103109
```yaml

β€Ždocs/book/src/developer/architecture/controllers/machine.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,10 @@ The `status` object **may** define several fields that do not affect functionali
6161
* `failureReason` - a string field explaining why a fatal error has occurred, if possible.
6262
* `failureMessage` - a string field that holds the message contained by the error.
6363

64+
Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the bootstrap config object,
65+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine).
66+
Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.
67+
6468
Example:
6569

6670
```yaml
@@ -105,6 +109,10 @@ defined as:
105109
- `type` (string): one of `Hostname`, `ExternalIP`, `InternalIP`, `ExternalDNS`, `InternalDNS`
106110
- `address` (string)
107111

112+
Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the infrastructureMachine object,
113+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine).
114+
Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.
115+
108116
Example:
109117
```yaml
110118
kind: MyMachine

β€Ždocs/book/src/developer/providers/bootstrap.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ A bootstrap provider must define an API type for bootstrap resources. The type:
2727
2. `failureMessage` (string): indicates there is a fatal problem reconciling the bootstrap data;
2828
meant to be a more descriptive value than `failureReason`
2929

30+
Note: once any of `failureReason` or `failureMessage` surface on the machine/machine pool who is referencing the bootstrap config object,
31+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine/machine pool).
32+
Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.
33+
3034
Note: because the `dataSecretName` is part of `status`, this value must be deterministically recreatable from the data in the
3135
`Cluster`, `Machine`, and/or bootstrap resource. If the name is randomly generated, it is not always possible to move
3236
the resource and its associated secret from one management cluster to another.

β€Ždocs/book/src/developer/providers/cluster-infrastructure.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,9 @@ A cluster infrastructure provider must define an API type for "infrastructure cl
3636
- `controlPlane` (bool): indicates if failure domain is appropriate for running control plane instances.
3737
- `attributes` (`map[string]string`): arbitrary attributes for users to apply to a failure domain.
3838

39+
Note: once any of `failureReason` or `failureMessage` surface on the cluster who is referencing the infrastructureCluster object,
40+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the cluster).
41+
3942
### InfraClusterTemplate Resources
4043

4144
For a given InfraCluster resource, you should also add a corresponding InfraClusterTemplate resources:

β€Ždocs/book/src/developer/providers/machine-infrastructure.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,9 @@ A machine infrastructure provider must define an API type for "infrastructure ma
4545
7. Should have a conditions field with the following:
4646
1. A Ready condition to represent the overall operational state of the component. It can be based on the summary of more detailed conditions existing on the same object, e.g. instanceReady, SecurityGroupsReady conditions.
4747

48+
Note: once any of `failureReason` or `failureMessage` surface on the machine who is referencing the infrastructureMachine object,
49+
they cannot be restored anymore (it is considered a terminal error; the only way to recover is to delete and recreate the machine).
50+
Also, if the machine is under control of a MachineHealthCheck instance, the machine will be automatically remediated.
4851

4952
### InfraMachineTemplate Resources
5053

β€Ždocs/book/src/tasks/automated-machine-management/healthchecking.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ A MachineHealthCheck is a resource within the Cluster API which allows users to
2020
A MachineHealthCheck is defined on a management cluster and scoped to a particular workload cluster.
2121

2222
When defining a MachineHealthCheck, users specify a timeout for each of the conditions that they define to check on the Machine's Node.
23-
If any of these conditions are met for the duration of the timeout, the Machine will be remediated.
23+
If any of these conditions are met for the duration of the timeout, the Machine will be remediated. Also, Machines with `failureMessage` or `failureMessage` (terminal failures) are automatically remediated.
2424
By default, the action of remediating a Machine should trigger a new Machine to be created to replace the failed one, but providers are allowed to plug in more sophisticated external remediation solutions.
2525

2626
## Creating a MachineHealthCheck

0 commit comments

Comments
Β (0)