Skip to content

Commit a00748f

Browse files
committed
CAEP: machine deletion phase hooks
Defines a set of annotations that can be applied to a machine which affect the linear progress of a machine’s lifecycle after a machine has been marked for deletion. These annotations are optional and may be applied during machine creation, sometime after machine creation by a user, or sometime after machine creation by another controller or application.
1 parent 3ac3514 commit a00748f

File tree

1 file changed

+315
-0
lines changed

1 file changed

+315
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,315 @@
1+
---
2+
title: Machine Deletion Phase Hooks
3+
authors:
4+
- "@michaelgugino"
5+
reviewers:
6+
- "@enxebre"
7+
- "@vincepri"
8+
- "@detiber"
9+
- "@ncdc"
10+
creation-date: 2020-06-02
11+
last-updated: 2020-06-02
12+
status: implementable
13+
---
14+
15+
# Machine Deletion Phase Hooks
16+
17+
## Table of Contents
18+
<!-- toc -->
19+
- [Machine Deletion Phase Hooks](#machine-deletion-phase-hooks)
20+
- [Table of Contents](#table-of-contents)
21+
- [Glossary](#glossary)
22+
- [lifecycle hook](#lifecycle-hook)
23+
- [deletion phase](#deletion-phase)
24+
- [Summary](#summary)
25+
- [Motivation](#motivation)
26+
- [Goals](#goals)
27+
- [Non-Goals/Future Work](#non-goalsfuture-work)
28+
- [Proposal](#proposal)
29+
- [User Stories](#user-stories)
30+
- [Story 1](#story-1)
31+
- [Story 2](#story-2)
32+
- [Story 3](#story-3)
33+
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
34+
- [Lifecycle Points](#lifecycle-points)
35+
- [pre-drain](#pre-drain)
36+
- [pre-terminate](#pre-terminate)
37+
- [Annotation Form](#annotation-form)
38+
- [lifecycle-point](#lifecycle-point)
39+
- [hook-name](#hook-name)
40+
- [owner (Optional)](#owner-optional)
41+
- [Annotation Examples](#annotation-examples)
42+
- [Changes to machine-controller](#changes-to-machine-controller)
43+
- [Reconciliation](#reconciliation)
44+
- [Hook failure](#hook-failure)
45+
- [Hook ordering](#hook-ordering)
46+
- [Owning Controller Design](#owning-controller-design)
47+
- [Owning Controllers must](#owning-controllers-must)
48+
- [Owning Controllers may](#owning-controllers-may)
49+
- [Determining when to take action](#determining-when-to-take-action)
50+
- [Failure Mode](#failure-mode)
51+
- [Risks and Mitigations](#risks-and-mitigations)
52+
- [Alternatives](#alternatives)
53+
- [Custom Machine Controller](#custom-machine-controller)
54+
- [Finalizers](#finalizers)
55+
- [Status Field](#status-field)
56+
- [Spec Field](#spec-field)
57+
- [CRDs](#crds)
58+
- [Upgrade Strategy](#upgrade-strategy)
59+
- [Additional Details](#additional-details)
60+
<!-- /toc -->
61+
62+
## Glossary
63+
64+
Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
65+
66+
### lifecycle hook
67+
A specific point in a machine's reconciliation lifecycle where execution of
68+
normal machine-controller behavior is paused or modified.
69+
70+
### deletion phase
71+
Describes when a machine has been marked for deletion but is still present
72+
in the API. Various actions happen during this phase, such as draining a node,
73+
deleting an instance from a cloud provider, and deleting a node object.
74+
75+
### Hook Implementing Controller (HIC)
76+
The Hook Implementing Controller describes a controller, other than the
77+
machine-controller, that adds, removes, and/or responds to a particular
78+
lifecycle hook. Each lifecycle hook should have a single HIC, but an HIC
79+
can optionally manage one or more hooks.
80+
81+
## Summary
82+
83+
Defines a set of annotations that can be applied to a machine which affect the
84+
linear progress of a machine’s lifecycle after a machine has been marked for
85+
deletion. These annotations are optional and may be applied during machine
86+
creation, sometime after machine creation by a user, or sometime after machine
87+
creation by another controller or application.
88+
89+
## Motivation
90+
91+
Allow custom and 3rd party components to easily interact with a machine or
92+
related resources while that machine's reconciliation is temporarily paused.
93+
This pause in reconciliation will allow these custom components to take action
94+
after a machine has been marked for deletion, but prior to the machine being
95+
drained and/or associated instance terminated.
96+
97+
### Goals
98+
99+
- Define an initial set of hook points for the deletion phase.
100+
- Define an initial set and form of related annotations.
101+
- Define basic expectations for a controller or process that responds to a
102+
lifecycle hook.
103+
104+
105+
### Non-Goals/Future Work
106+
107+
- Create an exhaustive list of hooks; we can add more over time.
108+
- Create new machine phases.
109+
- Create a mechanism to signal what lifecycle point a machine is at currently.
110+
- Dictate implementation of controllers that respond to the hooks.
111+
- Implement ordering in the machine-controller.
112+
- Require anyone use these hooks for normal machine operations, these are
113+
strictly optional and for custom integrations only.
114+
115+
116+
## Proposal
117+
118+
- Utilize annotations to implement lifecycle hooks.
119+
- Each lifecycle point can have 0 or more hooks.
120+
- Hooks do not enforce ordering.
121+
- Hooks found during machine reconciliation effectively pause reconciliation
122+
until all hooks for that lifecycle point are removed from a machine's annotations.
123+
124+
125+
### User Stories
126+
#### Story 1
127+
(pre-terminate) As an operator, I would like to have the ability to perform
128+
different actions between the time a machine is marked deleted in the api and
129+
the time the machine is deleted from the cloud.
130+
131+
For example, when replacing a control plane machine, ensure a new control
132+
plane machine has been successfully created and joined to the cluster before
133+
removing the instance of the deleted machine. This might be useful in case
134+
there are disruptions during replacement and we need the disk of the existing
135+
instance to perform some disaster recovery operation. This will also prevent
136+
prolonged periods of having one fewer control plane host in the event the
137+
replacement instance does not come up in a timely manner.
138+
139+
#### Story 2
140+
(pre-drain) As an operator, I want the ability to utilize my own draining
141+
controller instead of the logic built into the machine-controller. This will
142+
allow me better flexibility and control over the lifecycle of workloads on each
143+
node.
144+
145+
#### Story 3
146+
(pre-drain) As an operator, when I am deleting a control plane machine for
147+
maintenance reasons, I want to ensure my existing control plane machine is not
148+
drained until after my replacement comes online. This will prevent protracted
149+
periods of a missing control plane member if the replacement machine cannot be
150+
created in a timely manner.
151+
152+
153+
### Implementation Details/Notes/Constraints
154+
155+
For each defined lifecycle point, one or more hooks may be applied as an annotation to the machine object. These annotations will pause reconciliation of a machine object until all hooks are resolved for that lifecycle point. The hooks should be managed by an Hook Implementing Controler or other external application, or
156+
manually created and removed by an administrator.
157+
158+
#### Lifecycle Points
159+
##### pre-drain
160+
`pre-drain.delete.hook.machine.cluster.x-k8s.io`
161+
162+
Hooks defined at this point will prevent the machine-controller from draining a node after the machine-object has been marked for deletion until the hooks are removed.
163+
##### pre-terminate
164+
`pre-terminate.delete.hook.machine.cluster.x-k8s.io`
165+
166+
Hooks defined at this point will prevent the machine-controller from
167+
removing/terminating the instance in the cloud provider until the hooks are
168+
removed.
169+
170+
"pre-terminate" has been chosen over "pre-delete" because "terminate" is more
171+
easily associated with an instance being removed from the cloud or
172+
infrastructure, whereas "delete" is ambiguous as to the actual state of the
173+
machine in its lifecycle.
174+
175+
176+
#### Annotation Form
177+
```
178+
<lifecycle-point>.delete.hook.machine.cluster-api.x-k8s.io/<hook-name>: <owner/creator>
179+
```
180+
181+
##### lifecycle-point
182+
This is the point in the lifecycle of reconciling a machine the annotation will have effect and pause the machine-controller.
183+
184+
##### hook-name
185+
Each hook should have a unique and descriptive name that describes in 1-3 words what the intent/reason for the hook is. Each hook name should be unique and managed by a single entity.
186+
187+
##### owner (Optional)
188+
Some information about who created or is otherwise in charge of managing the annotation. This might be a controller or a username to indicate an administrator applied the hook directly.
189+
190+
##### Annotation Examples
191+
192+
These examples are all hypothetical to illustrate what form annotations should
193+
take. The names of of each hook and the respective controllers are fictional.
194+
195+
pre-drain.hook.machine.cluster-api.x-k8s.io/migrate-important-app: my-app-migration-controller
196+
197+
pre-terminate.hook.machine.cluster-api.x-k8s.io/backup-files: my-backup-controller
198+
199+
pre-terminate.hook.machine.cluster-api.x-k8s.io/wait-for-storage-detach: my-custom-storage-detach-controller
200+
201+
#### Changes to machine-controller
202+
The machine-controller should check for the existence of 1 or more hooks at
203+
specific points (lifecycle-points) during reconciliation. If a hook matching
204+
the lifecycle-point is discovered, the machine-controller should stop
205+
reconciling the machine.
206+
207+
An example of where the pre-drain lifecycle-point might be implemented:
208+
https://github.com/kubernetes-sigs/cluster-api/blob/30c377c0964efc789ab2f3f7361eb323003a7759/controllers/machine_controller.go#L270
209+
210+
##### Reconciliation
211+
When a Hook Implementing Controller updates the machine, reconciliation will be
212+
triggered, and the machine will continue reconciling as normal, unless another
213+
hook is still present; there is no need to 'fail' the reconciliation to
214+
enforce requeuing.
215+
216+
When all hooks for a given lifecycle-point are removed, reconciliation
217+
will continue as normal.
218+
219+
##### Hook failure
220+
The machine-controller should not timeout or otherwise consider the lifecycle
221+
hook as 'failed.' Only the Hook Implementing Controller may decide to remove a
222+
particular lifecycle hook to allow the machine-controller to progress past the
223+
corresponding lifecycle-point.
224+
225+
##### Hook ordering
226+
The machine-controller will not attempt to enforce any ordering of hooks.
227+
228+
Hook Implementing Controllers may devise a dependency-based enforcement mechanism
229+
via whatever means HICs determine. Examples could be
230+
using CRDs external to the machine-api, gRPC communications, or
231+
additional annotations on the machine or other objects.
232+
233+
#### Hook Implementing Controller Design
234+
Hook Implementing Controller is the component that manages a particular
235+
lifecycle hook.
236+
237+
##### Hook Implementing Controllers must
238+
* Watch machine objects and determine when an appropriate action must be taken.
239+
* After completing the desired hook action, remove the hook annotation.
240+
##### Hook Implementing Controllers may
241+
* Watch machine objects and add a hook annotation as desired by the cluster
242+
administrator.
243+
* After completing a hook, set an addition annotation that indicates this
244+
controller has finished for dependency ordering, or perform some other action
245+
to signal other Hook Implementing Controllers that they may proceed.
246+
247+
#### Determining when to take action
248+
249+
An Hook Implementing Controller should watch machines and determine when is the
250+
best time to take action.
251+
252+
For example, if an HIC manages a lifecycle hook at the pre-drain lifecycle-point,
253+
then that controller should take action immediately after a machine has a
254+
DeletionTimestamp or enters the "Deleting" phase.
255+
256+
Fine-tuned coordination is not possible at this time; eg, it's not
257+
possible to execute a pre-terminate hook only after a node has been drained.
258+
This is reserved for future work.
259+
260+
##### Failure Mode
261+
It is entirely up to the Hook Implementing Controller to determine when it is
262+
prudent to remove a particular lifecycle hook. Some controllers may want to
263+
'give up' after a certain time period, and others may want to block indefinitely.
264+
Cluster operators should consider the characteristics of each controller before
265+
utilizing them in their clusters.
266+
267+
268+
### Risks and Mitigations
269+
270+
* Annotation keys must conform to length limits: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set
271+
* Requires well-behaved controllers and admins to keep things running
272+
smoothly. Would be easy to disrupt machines with poor configuration.
273+
* Troubleshooting problems may increase in complexity, but this is
274+
mitigated mostly by the fact that these hooks are opt-in. Operators
275+
will or should know they are consuming these hooks, but a future proliferation
276+
of the cluster-api could result in these components being bundled as a
277+
complete solution that operators just consume. To this end, we should
278+
update any troubleshooting guides to check these hook points where possible.
279+
280+
281+
## Alternatives
282+
283+
### Custom Machine Controller
284+
Require advanced users to fork and customize. This can already be done if someone chooses, so not much of a solution.
285+
286+
### Finalizers
287+
We define additional finalizers, but this really only implies the deletion lifecycle point. A misbehaving controller that
288+
accidentally removes finalizers could have undesireable
289+
effects.
290+
291+
### Status Field
292+
Harder for users to modify or set hooks during machine creation. How would a user remove a hook if a controller that is supposed to remove it is misbehaving? We’d probably need an annotation like ‘skip-hook-xyz’ or similar and that seems redundant to just using annotations in the first place
293+
294+
### Spec Field
295+
We probably don’t want other controllers dynamically adding and removing spec fields on an object. It’s not very declarative to utilize spec fields in that way.
296+
297+
### CRDs
298+
Seems like we’d need to sync information to and from a CR. There are different approaches to CRDs (1-to-1 mapping machine to CR, match labels, present/absent vs status fields) that each have their own drawbacks and are more complex to define and configure.
299+
300+
301+
## Upgrade Strategy
302+
303+
Nothing defined here should directly impact upgrades other than defining hooks that impact creation/deletion of a machine, generally.
304+
305+
## Additional Details
306+
307+
Fine-tuned timing of hooks is not possible at this time.
308+
309+
In the future, it is possible to implement this timing via additional
310+
machine phases, or possible "sub-phases" or some other mechanism
311+
that might be appropriate. As stated in the non-goals, that is
312+
not in scope at this time, and could be future work.
313+
314+
<!-- Links -->
315+
[community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY

0 commit comments

Comments
 (0)