Skip to content

Commit 5cacc9f

Browse files
committed
CAEP: machine deletion phase hooks
Defines a set of annotations that can be applied to a machine which affect the linear progress of a machine’s lifecycle after a machine has been marked for deletion. These annotations are optional and may be applied during machine creation, sometime after machine creation by a user, or sometime after machine creation by another controller or application.
1 parent 3ac3514 commit 5cacc9f

File tree

1 file changed

+313
-0
lines changed

1 file changed

+313
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,313 @@
1+
---
2+
title: Machine Deletion Phase Hooks
3+
authors:
4+
- "@michaelgugino"
5+
reviewers:
6+
- "@enxebre"
7+
- "@vincepri"
8+
- "@detiber"
9+
- "@ncdc"
10+
creation-date: 2020-06-02
11+
last-updated: 2020-06-02
12+
status: implementable
13+
---
14+
15+
# Machine Deletion Phase Hooks
16+
17+
## Table of Contents
18+
<!-- toc -->
19+
- [Machine Deletion Phase Hooks](#machine-deletion-phase-hooks)
20+
- [Table of Contents](#table-of-contents)
21+
- [Glossary](#glossary)
22+
- [lifecycle hook](#lifecycle-hook)
23+
- [deletion phase](#deletion-phase)
24+
- [Summary](#summary)
25+
- [Motivation](#motivation)
26+
- [Goals](#goals)
27+
- [Non-Goals/Future Work](#non-goalsfuture-work)
28+
- [Proposal](#proposal)
29+
- [User Stories](#user-stories)
30+
- [Story 1](#story-1)
31+
- [Story 2](#story-2)
32+
- [Story 3](#story-3)
33+
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
34+
- [Lifecycle Points](#lifecycle-points)
35+
- [pre-drain](#pre-drain)
36+
- [pre-terminate](#pre-terminate)
37+
- [Annotation Form](#annotation-form)
38+
- [lifecycle-point](#lifecycle-point)
39+
- [hook-name](#hook-name)
40+
- [owner (Optional)](#owner-optional)
41+
- [Annotation Examples](#annotation-examples)
42+
- [Changes to machine-controller](#changes-to-machine-controller)
43+
- [Reconciliation](#reconciliation)
44+
- [Hook failure](#hook-failure)
45+
- [Hook ordering](#hook-ordering)
46+
- [Owning Controller Design](#owning-controller-design)
47+
- [Owning Controllers must](#owning-controllers-must)
48+
- [Owning Controllers may](#owning-controllers-may)
49+
- [Determining when to take action](#determining-when-to-take-action)
50+
- [Failure Mode](#failure-mode)
51+
- [Risks and Mitigations](#risks-and-mitigations)
52+
- [Alternatives](#alternatives)
53+
- [Custom Machine Controller](#custom-machine-controller)
54+
- [Finalizers](#finalizers)
55+
- [Status Field](#status-field)
56+
- [Spec Field](#spec-field)
57+
- [CRDs](#crds)
58+
- [Upgrade Strategy](#upgrade-strategy)
59+
- [Additional Details](#additional-details)
60+
<!-- /toc -->
61+
62+
## Glossary
63+
64+
Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html).
65+
66+
### lifecycle hook
67+
A specific point in a machine's reconciliation lifecycle where execution of
68+
normal machine-controller behavior is paused or modified.
69+
70+
### deletion phase
71+
Describes when a machine has been marked for deletion but is still present
72+
in the API. Various actions happen during this phase, such as draining a node,
73+
deleting an instance from a cloud provider, and deleting a node object.
74+
75+
### Hook Implementing Controller (HIC)
76+
The Hook Implementing Controller describes a controller, other than the
77+
machine-controller, that adds, removes, and/or responds to a particular
78+
lifecycle hook. Each lifecycle hook should have a single HIC, but an HIC
79+
can optionally manage one or more hooks.
80+
81+
## Summary
82+
83+
Defines a set of annotations that can be applied to a machine which affect the
84+
linear progress of a machine’s lifecycle after a machine has been marked for
85+
deletion. These annotations are optional and may be applied during machine
86+
creation, sometime after machine creation by a user, or sometime after machine
87+
creation by another controller or application.
88+
89+
## Motivation
90+
91+
Allow custom and 3rd party components to easily interact with a machine or
92+
related resources while that machine's reconciliation is temporarily paused.
93+
This pause in reconciliation will allow these custom components to take action
94+
after a machine has been marked for deletion, but prior to the machine being
95+
drained and/or associated instance terminated.
96+
97+
### Goals
98+
99+
- Define an initial set of hook points for the deletion phase.
100+
- Define an initial set and form of related annotations.
101+
- Define basic expectations for a controller or process that responds to a
102+
lifecycle hook.
103+
104+
105+
### Non-Goals/Future Work
106+
107+
- Create an exhaustive list of hooks; we can add more over time.
108+
- Create new machine phases.
109+
- Create a mechanism to signal what lifecycle point a machine is at currently.
110+
- Dictate implementation of controllers that respond to the hooks.
111+
- Implement ordering in the machine-controller.
112+
- Require anyone use these hooks for normal machine operations, these are
113+
strictly optional and for custom integrations only.
114+
115+
116+
## Proposal
117+
118+
- Utilize annotations to implement lifecycle hooks.
119+
- Each lifecycle point can have 0 or more hooks.
120+
- Hooks do not enforce ordering.
121+
- Hooks found during machine reconciliation effectively pause reconciliation
122+
until all hooks for that lifecycle point are removed from a machine's annotations.
123+
124+
125+
### User Stories
126+
#### Story 1
127+
(pre-terminate) As an operator, I would like to have the ability to perform
128+
different actions between the time a machine is marked deleted in the api and
129+
the time the machine is deleted from the cloud.
130+
131+
For example, when replacing a control plane machine, ensure a new control
132+
plane machine has been successfully created and joined to the cluster before
133+
removing the instance of the deleted machine. This might be useful in case
134+
there are disruptions during replacement and we need the disk of the existing
135+
instance to perform some disaster recovery operation. This will also prevent
136+
prolonged periods of having one fewer control plane host in the event the
137+
replacement instance does not come up in a timely manner.
138+
139+
#### Story 2
140+
(pre-drain) As an operator, I want the ability to utilize my own draining
141+
controller instead of the logic built into the machine-controller. This will
142+
allow me better flexibility and control over the lifecycle of workloads on each
143+
node.
144+
145+
### Implementation Details/Notes/Constraints
146+
147+
For each defined lifecycle point, one or more hooks may be applied as an annotation to the machine object. These annotations will pause reconciliation of a machine object until all hooks are resolved for that lifecycle point. The hooks should be managed by an Hook Implementing Controler or other external application, or
148+
manually created and removed by an administrator.
149+
150+
#### Lifecycle Points
151+
##### pre-drain
152+
`pre-drain.delete.hook.machine.cluster.x-k8s.io`
153+
154+
Hooks defined at this point will prevent the machine-controller from draining a node after the machine-object has been marked for deletion until the hooks are removed.
155+
##### pre-terminate
156+
`pre-terminate.delete.hook.machine.cluster.x-k8s.io`
157+
158+
Hooks defined at this point will prevent the machine-controller from
159+
removing/terminating the instance in the cloud provider until the hooks are
160+
removed.
161+
162+
"pre-terminate" has been chosen over "pre-delete" because "terminate" is more
163+
easily associated with an instance being removed from the cloud or
164+
infrastructure, whereas "delete" is ambiguous as to the actual state of the
165+
machine in its lifecycle.
166+
167+
168+
#### Annotation Form
169+
```
170+
<lifecycle-point>.delete.hook.machine.cluster-api.x-k8s.io/<hook-name>: <owner/creator>
171+
```
172+
173+
##### lifecycle-point
174+
This is the point in the lifecycle of reconciling a machine the annotation will have effect and pause the machine-controller.
175+
176+
##### hook-name
177+
Each hook should have a unique and descriptive name that describes in 1-3 words what the intent/reason for the hook is. Each hook name should be unique and managed by a single entity.
178+
179+
##### owner (Optional)
180+
Some information about who created or is otherwise in charge of managing the annotation. This might be a controller or a username to indicate an administrator applied the hook directly.
181+
182+
##### Annotation Examples
183+
184+
These examples are all hypothetical to illustrate what form annotations should
185+
take. The names of of each hook and the respective controllers are fictional.
186+
187+
pre-drain.hook.machine.cluster-api.x-k8s.io/migrate-important-app: my-app-migration-controller
188+
189+
pre-terminate.hook.machine.cluster-api.x-k8s.io/backup-files: my-backup-controller
190+
191+
pre-terminate.hook.machine.cluster-api.x-k8s.io/wait-for-storage-detach: my-custom-storage-detach-controller
192+
193+
#### Changes to machine-controller
194+
The machine-controller should check for the existence of 1 or more hooks at
195+
specific points (lifecycle-points) during reconciliation. If a hook matching
196+
the lifecycle-point is discovered, the machine-controller should stop
197+
reconciling the machine.
198+
199+
An example of where the pre-drain lifecycle-point might be implemented:
200+
https://github.com/kubernetes-sigs/cluster-api/blob/30c377c0964efc789ab2f3f7361eb323003a7759/controllers/machine_controller.go#L270
201+
202+
##### Reconciliation
203+
When a Hook Implementing Controller updates the machine, reconciliation will be
204+
triggered, and the machine will continue reconciling as normal, unless another
205+
hook is still present; there is no need to 'fail' the reconciliation to
206+
enforce requeuing.
207+
208+
When all hooks for a given lifecycle-point are removed, reconciliation
209+
will continue as normal.
210+
211+
##### Hook failure
212+
The machine-controller should not timeout or otherwise consider the lifecycle
213+
hook as 'failed.' Only the Hook Implementing Controller may decide to remove a
214+
particular lifecycle hook to allow the machine-controller to progress past the
215+
corresponding lifecycle-point.
216+
217+
##### Hook ordering
218+
The machine-controller will not attempt to enforce any ordering of hooks. No
219+
ordering should be expected by the machine-controller.
220+
221+
Hook Implementing Controllers may choose to provide a mechanism to allow
222+
ordering amongst themselves via whatever means HICs determine. Examples could
223+
be using CRDs external to the machine-api, gRPC communications, or
224+
additional annotations on the machine or other objects.
225+
226+
#### Hook Implementing Controller Design
227+
Hook Implementing Controller is the component that manages a particular
228+
lifecycle hook.
229+
230+
##### Hook Implementing Controllers must
231+
* Watch machine objects and determine when an appropriate action must be taken.
232+
* After completing the desired hook action, remove the hook annotation.
233+
234+
##### Hook Implementing Controllers may
235+
* Watch machine objects and add a hook annotation as desired by the cluster
236+
administrator.
237+
* Coordinate with other Hook Implementing Controllers through any means
238+
possible, such as using common annotations, CRDs, etc. For example, one hook
239+
controller could set an annotation indicating it has finished its work, and
240+
another hook controller could wait for the presence of the annotation before
241+
proceeding.
242+
243+
#### Determining when to take action
244+
245+
An Hook Implementing Controller should watch machines and determine when is the
246+
best time to take action.
247+
248+
For example, if an HIC manages a lifecycle hook at the pre-drain lifecycle-point,
249+
then that controller should take action immediately after a machine has a
250+
DeletionTimestamp or enters the "Deleting" phase.
251+
252+
Fine-tuned coordination is not possible at this time; eg, it's not
253+
possible to execute a pre-terminate hook only after a node has been drained.
254+
This is reserved for future work.
255+
256+
##### Failure Mode
257+
It is entirely up to the Hook Implementing Controller to determine when it is
258+
prudent to remove a particular lifecycle hook. Some controllers may want to
259+
'give up' after a certain time period, and others may want to block indefinitely.
260+
Cluster operators should consider the characteristics of each controller before
261+
utilizing them in their clusters.
262+
263+
264+
### Risks and Mitigations
265+
266+
* Annotation keys must conform to length limits: https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set
267+
* Requires well-behaved controllers and admins to keep things running
268+
smoothly. Would be easy to disrupt machines with poor configuration.
269+
* Troubleshooting problems may increase in complexity, but this is
270+
mitigated mostly by the fact that these hooks are opt-in. Operators
271+
will or should know they are consuming these hooks, but a future proliferation
272+
of the cluster-api could result in these components being bundled as a
273+
complete solution that operators just consume. To this end, we should
274+
update any troubleshooting guides to check these hook points where possible.
275+
276+
277+
## Alternatives
278+
279+
### Custom Machine Controller
280+
Require advanced users to fork and customize. This can already be done if someone chooses, so not much of a solution.
281+
282+
### Finalizers
283+
We define additional finalizers, but this really only implies the deletion lifecycle point. A misbehaving controller that
284+
accidentally removes finalizers could have undesireable
285+
effects.
286+
287+
### Status Field
288+
Harder for users to modify or set hooks during machine creation. How would a user remove a hook if a controller that is supposed to remove it is misbehaving? We’d probably need an annotation like ‘skip-hook-xyz’ or similar and that seems redundant to just using annotations in the first place
289+
290+
### Spec Field
291+
We probably don’t want other controllers dynamically adding and removing spec fields on an object. It’s not very declarative to utilize spec fields in that way.
292+
293+
### CRDs
294+
Seems like we’d need to sync information to and from a CR. There are different approaches to CRDs (1-to-1 mapping machine to CR, match labels, present/absent vs status fields) that each have their own drawbacks and are more complex to define and configure.
295+
296+
297+
## Upgrade Strategy
298+
299+
Nothing defined here should directly impact upgrades other than defining hooks that impact creation/deletion of a machine, generally.
300+
301+
## Additional Details
302+
303+
Fine-tuned timing of hooks is not possible at this time.
304+
305+
In the future, it is possible to implement this timing via additional
306+
machine phases, or possible "sub-phases" or some other mechanism
307+
that might be appropriate. As stated in the non-goals, that is
308+
not in scope at this time, and could be future work. This is currently
309+
being discussed in [issue 3365].
310+
311+
<!-- Links -->
312+
[community meeting]: https://docs.google.com/document/d/1Ys-DOR5UsgbMEeciuG0HOgDQc8kZsaWIWJeKJ1-UfbY
313+
[issue 3365]: https://github.com/kubernetes-sigs/cluster-api/issues/3365

0 commit comments

Comments
 (0)