Skip to content

Commit 189a30c

Browse files
committed
CAEP: machine deletion phase hooks
Defines a set of annotations that can be applied to a machine which affect the linear progress of a machine’s lifecycle after a machine has been marked for deletion. These annotations are optional and may be applied during machine creation, sometime after machine creation by a user, or sometime after machine creation by another controller or application.
1 parent 3ac3514 commit 189a30c

File tree

1 file changed

+297
-0
lines changed

1 file changed

+297
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
---
2+
title: Machine Deletion Phase Hooks
3+
authors:
4+
- "@michaelgugino"
5+
reviewers:
6+
- "@enxebre"
7+
- "@vincepri"
8+
- "@detiber"
9+
- "@ncdc"
10+
creation-date: 2020-06-02
11+
last-updated: 2020-06-02
12+
status: implementable
13+
---
14+
15+
# Machine Deletion Phase Hooks
16+
17+
## Table of Contents
18+
<!-- toc -->
19+
- [Machine Deletion Phase Hooks](#machine-deletion-phase-hooks)
20+
- [Table of Contents](#table-of-contents)
21+
- [Glossary](#glossary)
22+
- [lifecycle hook](#lifecycle-hook)
23+
- [deletion phase](#deletion-phase)
24+
- [Summary](#summary)
25+
- [Motivation](#motivation)
26+
- [Goals](#goals)
27+
- [Non-Goals/Future Work](#non-goalsfuture-work)
28+
- [Proposal](#proposal)
29+
- [User Stories](#user-stories)
30+
- [Story 1](#story-1)
31+
- [Story 2](#story-2)
32+
- [Story 3](#story-3)
33+
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
34+
- [Lifecycle Points](#lifecycle-points)
35+
- [pre-drain](#pre-drain)
36+
- [pre-term](#pre-term)
37+
- [Annotation Form](#annotation-form)
38+
- [lifecycle-point](#lifecycle-point)
39+
- [hook-name](#hook-name)
40+
- [owner (Optional)](#owner-optional)
41+
- [Annotation Examples](#annotation-examples)
42+
- [Changes to machine-controller](#changes-to-machine-controller)
43+
- [Reconciliation](#reconciliation)
44+
- [Hook failure](#hook-failure)
45+
- [Hook ordering](#hook-ordering)
46+
- [Owning Controller Design](#owning-controller-design)
47+
- [Owning Controllers must](#owning-controllers-must)
48+
- [Owning Controllers may](#owning-controllers-may)
49+
- [Determining when to take action](#determining-when-to-take-action)
50+
- [Failure Mode](#failure-mode)
51+
- [Risks and Mitigations](#risks-and-mitigations)
52+
- [Alternatives](#alternatives)
53+
- [Custom Machine Controller](#custom-machine-controller)
54+
- [Finalizers](#finalizers)
55+
- [Status Field](#status-field)
56+
- [Spec Field](#spec-field)
57+
- [CRDs](#crds)
58+
- [Upgrade Strategy](#upgrade-strategy)
59+
- [Additional Details](#additional-details)
60+
<!-- /toc -->
61+
62+
## Glossary
63+
64+
Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
65+
66+
### lifecycle hook
67+
A specific point in a machine's reconciliation lifecycle where execution of
68+
normal machine-controller behavior is paused or modified.
69+
70+
### deletion phase
71+
Describes when a machine has been marked for deletion but is still present
72+
in the API. Various actions happen during this phase, such as draining a node,
73+
deleting an instance from a cloud provider, and deleting a node object.
74+
75+
## Summary
76+
77+
Defines a set of annotations that can be applied to a machine which affect the
78+
linear progress of a machine’s lifecycle after a machine has been marked for
79+
deletion. These annotations are optional and may be applied during machine
80+
creation, sometime after machine creation by a user, or sometime after machine
81+
creation by another controller or application.
82+
83+
## Motivation
84+
85+
Allow custom and 3rd party components to easily interact with a machine or
86+
related resources while that machine's reconciliation is temporarily paused.
87+
This pause in reconciliation will allow these custom components to take action
88+
after a machine has been marked for deletion, but prior to the machine being
89+
drained and/or associated instance terminated.
90+
91+
### Goals
92+
93+
- Define an initial set of hook points for the deletion phase.
94+
- Define an initial set and form of related annotations.
95+
- Define basic expectations for a controller or process that responds to a
96+
lifecycle hook.
97+
98+
99+
### Non-Goals/Future Work
100+
101+
- Create an exhaustive list of hooks; we can add more over time.
102+
- Create new machine phases
103+
- Dictate implementation of controllers that respond to the hooks.
104+
- Implement ordering in the machine-controller.
105+
- Require anyone use these hooks for normal machine operations, these are
106+
strictly optional and for custom integrations only.
107+
108+
109+
## Proposal
110+
111+
- Utilize annotations to implement lifecycle hooks.
112+
- Each lifecycle point can have 0 or more hooks.
113+
- Hooks do not enforce ordering.
114+
- Hooks found during machine reconciliation effectively pause reconciliation
115+
until all hooks for that lifecycle point are removed from a machine's annotations.
116+
117+
118+
### User Stories
119+
#### Story 1
120+
(pre-term) As an operator, I would like to have the ability to perform
121+
different actions between the time a machine is marked deleted in the api and
122+
the time the machine is deleted from the cloud.
123+
124+
For example, when replacing a control plane machine, ensure a new control
125+
plane machine has been successfully created and joined to the cluster before
126+
removing the instance of the deleted machine. This might be useful in case
127+
there are disruptions during replacement and we need the disk of the existing
128+
instance to perform some disaster recovery operation. This will also prevent
129+
prolonged periods of having one fewer control plane host in the event the
130+
replacement instance does not come up in a timely manner.
131+
132+
#### Story 2
133+
(pre-drain) As an operator, I want the ability to utilize my own draining
134+
controller instead of the logic built into the machine-controller. This will
135+
allow me better flexibility and control over the lifecycle of workloads on each
136+
node.
137+
138+
#### Story 3
139+
(pre-drain) As an operator, when I am deleting a control plane machine for
140+
maintenance reasons, I want to ensure my existing control plane machine is not
141+
drained until after my replacement comes online. This will prevent protracted
142+
periods of a missing control plane member if the replacement machine cannot be
143+
created in a timely manner.
144+
145+
146+
### Implementation Details/Notes/Constraints
147+
148+
For each defined lifecycle point, one or more hooks may be applied as an annotation to the machine object. These annotations will pause reconciliation of a machine object until all hooks are resolved for that lifecycle point. The hooks should be managed by an Owning Controller or other external application, or manually created and removed by an administrator.
149+
150+
#### Lifecycle Points
151+
##### pre-drain
152+
Hooks defined at this point will prevent the machine-controller from draining a node after the machine-object has been marked for deletion until the hooks are removed.
153+
##### pre-term
154+
Hooks defined at this point will prevent the machine-controller from
155+
removing/terminating the instance in the cloud provider until the hooks are
156+
removed.
157+
158+
"pre-term" has been chosen over "pre-delete" because "terminate" is more
159+
easily associated with an instance being removed from the cloud or
160+
infrastructure, whereas "delete" is ambiguous as to the actual state of the
161+
machine in its lifecycle.
162+
163+
164+
#### Annotation Form
165+
```
166+
<lifecycle-point>.hook.machine.cluster-api.x-k8s.io/<hook-name>: <owner/creator>
167+
```
168+
169+
##### lifecycle-point
170+
This is the point in the lifecycle of reconciling a machine the annotation will have effect and pause the machine-controller.
171+
172+
##### hook-name
173+
Each hook should have a unique and descriptive name that describes in 1-3 words what the intent/reason for the hook is. Each hook name should be unique and managed by a single entity.
174+
175+
##### owner (Optional)
176+
Some information about who created or is otherwise in charge of managing the annotation. This might be a controller or a username to indicate an administrator applied the hook directly.
177+
178+
##### Annotation Examples
179+
pre-drain.hook.machine.cluster-api.x-k8s.io/migrate-important-app: my-app-migration-controller
180+
181+
pre-term.hook.machine.cluster-api.x-k8s.io/preserve-instance: my-instance-replacement-controller
182+
183+
pre-term.hook.machine.cluster-api.x-k8s.io/never-delete: cluster-admin
184+
185+
#### Changes to machine-controller
186+
The machine-controller should check for the existence of 1 or more hooks at
187+
specific points (lifecycle-points) during reconciliation. If a hook matching
188+
the lifecycle-point is discovered, the machine-controller should stop
189+
reconciling the machine.
190+
191+
An example of where the pre-drain lifecycle-point might be implemented:
192+
https://github.com/kubernetes-sigs/cluster-api/blob/30c377c0964efc789ab2f3f7361eb323003a7759/controllers/machine_controller.go#L270
193+
194+
##### Reconciliation
195+
When an Owning Controller updates the machine, reconciliation will be
196+
triggered, and the machine will continue reconciling as normal, unless another
197+
hook is still present; there is no need to 'fail' the reconciliation to
198+
enforce requeuing.
199+
200+
When all hooks for a given lifecycle-point are removed, reconciliation
201+
will continue as normal.
202+
203+
##### Hook failure
204+
The machine-controller should not timeout or otherwise consider the lifecycle
205+
hook as 'failed.' Only the Owning Controller may decide to remove a
206+
particular lifecycle hook to allow the machine-controller to progress past the
207+
corresponding lifecycle-point.
208+
209+
##### Hook ordering
210+
The machine-controller will not attempt to enforce any ordering of hooks.
211+
212+
Owning Controllers may devise a dependency-based enforcement mechanism
213+
via whatever means Owning Controllers determine. Examples could be
214+
using CRDs external to the machine-api, gRPC communications, or
215+
additional annotations on the machine or other objects.
216+
217+
#### Owning Controller Design
218+
Owning Controller is the component that manages a particular lifecycle hook.
219+
220+
##### Owning Controllers must
221+
* Watch machine objects and determine when an appropriate action must be taken.
222+
* After completing the desired hook action, remove the hook annotation.
223+
##### Owning Controllers may
224+
* Watch machine objects and add a hook annotation as desired by the operator
225+
* After completing a hook, set an addition annotation that indicates this
226+
controller has finished for dependency ordering, or perform some other action
227+
to signal other Owning Controllers that they may proceed.
228+
229+
#### Determining when to take action
230+
231+
An Owning Controller should watch machines and determine when is the
232+
best time to take action.
233+
234+
For example, if an Owning Controller manages a lifecycle hook at the
235+
pre-drain lifecycle-point, then that controller should take action
236+
immediately after a machine has a DeletionTimestamp or enters the
237+
"Deleting" phase.
238+
239+
Fine-tuned coordination is not possible at this time; eg, it's not
240+
possible to execute a pre-term hook only after a node has been drained.
241+
242+
##### Failure Mode
243+
It is entirely up to the owning controller to determine when it is prudent to
244+
remove a particular lifecycle hook. Some controllers may want to 'give up'
245+
after a certain time period, and others may want to block indefinitely.
246+
Cluster operators should consider the characteristics of each controller before
247+
utilizing them in their clusters.
248+
249+
250+
### Risks and Mitigations
251+
252+
* Annotation keys must conform to length limits: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set
253+
* Requires well-behaved controllers and admins to keep things running
254+
smoothly. Would be easy to disrupt machines with poor configuration.
255+
* Troubleshooting problems may increase in complexity, but this is
256+
mitigated mostly by the fact that these hooks are opt-in. Operators
257+
will or should know they are consuming these hooks, but a future proliferation
258+
of the cluster-api could result in these components being bundled as a
259+
complete solution that operators just consume. To this end, we should
260+
update any troubleshooting guides to check these hook points where possible.
261+
262+
263+
## Alternatives
264+
265+
### Custom Machine Controller
266+
Require advanced users to fork and customize. This can already be done if someone chooses, so not much of a solution.
267+
268+
### Finalizers
269+
We define additional finalizers, but this really only implies the deletion lifecycle point. A misbehaving controller that
270+
accidentally removes finalizers could have undesireable
271+
effects.
272+
273+
### Status Field
274+
Harder for users to modify or set hooks during machine creation. How would a user remove a hook if a controller that is supposed to remove it is misbehaving? We’d probably need an annotation like ‘skip-hook-xyz’ or similar and that seems redundant to just using annotations in the first place
275+
276+
### Spec Field
277+
We probably don’t want other controllers dynamically adding and removing spec fields on an object. It’s not very declarative to utilize spec fields in that way.
278+
279+
### CRDs
280+
Seems like we’d need to sync information to and from a CR. There are different approaches to CRDs (1-to-1 mapping machine to CR, match labels, present/absent vs status fields) that each have their own drawbacks and are more complex to define and configure.
281+
282+
283+
## Upgrade Strategy
284+
285+
Nothing defined here should directly impact upgrades other than defining hooks that impact creation/deletion of a machine, generally.
286+
287+
## Additional Details
288+
289+
Fine-tuned timing of hooks is not possible at this time.
290+
291+
In the future, it is possible to implement this timing via additional
292+
machine phases, or possible "sub-phases" or some other mechanism
293+
that might be appropriate. As stated in the non-goals, that is
294+
not in scope at this time, and could be future work.
295+
296+
<!-- Links -->
297+
[community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY

0 commit comments

Comments
 (0)