Skip to content

Commit 9805a93

Browse files
authored
Merge pull request #2552 from janetkuo/job-ttl-cleanup
KEP for TTL-after-finished controller
2 parents 06e7e7b + 80e8c2b commit 9805a93

File tree

2 files changed

+297
-1
lines changed

2 files changed

+297
-1
lines changed

keps/NEXT_KEP_NUMBER

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
26
1+
27
+296
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
---
2+
kep-number: 26
3+
title: TTL After Finished
4+
authors:
5+
- "@janetkuo"
6+
owning-sig: sig-apps
7+
participating-sigs:
8+
- sig-api-machinery
9+
reviewers:
10+
- @enisoc
11+
- @tnozicka
12+
approvers:
13+
- @kow3ns
14+
editor: TBD
15+
creation-date: 2018-08-16
16+
last-updated: 2018-08-16
17+
status: provisional
18+
see-also:
19+
- n/a
20+
replaces:
21+
- n/a
22+
superseded-by:
23+
- n/a
24+
---
25+
26+
# TTL After Finished Controller
27+
28+
## Table of Contents
29+
30+
A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template.
31+
[Tools for generating][] a table of contents from markdown are available.
32+
33+
* [TTL After Finished Controller](#ttl-after-finished-controller)
34+
* [Table of Contents](#table-of-contents)
35+
* [Summary](#summary)
36+
* [Motivation](#motivation)
37+
* [Goals](#goals)
38+
* [Proposal](#proposal)
39+
* [Concrete Use Cases](#concrete-use-cases)
40+
* [Detailed Design](#detailed-design)
41+
* [Feature Gate](#feature-gate)
42+
* [API Object](#api-object)
43+
* [Validation](#validation)
44+
* [User Stories](#user-stories)
45+
* [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
46+
* [TTL Controller](#ttl-controller)
47+
* [Finished Jobs](#finished-jobs)
48+
* [Finished Pods](#finished-pods)
49+
* [Owner References](#owner-references)
50+
* [Risks and Mitigations](#risks-and-mitigations)
51+
* [Graduation Criteria](#graduation-criteria)
52+
* [Implementation History](#implementation-history)
53+
54+
[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
55+
56+
## Summary
57+
58+
We propose a TTL mechanism to limit the lifetime of finished resource objects,
59+
including Jobs and Pods, to make it easy for users to clean up old Jobs/Pods
60+
after they finish. The TTL timer starts when the Job/Pod finishes, and the
61+
finished Job/Pod will be cleaned up after the TTL expires.
62+
63+
## Motivation
64+
65+
In Kubernetes, finishable resources, such as Jobs and Pods, are often
66+
frequently-created and short-lived. If a Job or Pod isn't controlled by a
67+
higher-level resource (e.g. CronJob for Jobs or Job for Pods), or owned by some
68+
other resources, it's difficult for the users to clean them up automatically,
69+
and those Jobs and Pods can accumulate and overload a Kubernetes cluster very
70+
easily. Even if we can avoid the overload issue by implementing a cluster-wide
71+
(global) resource quota, users won't be able to create new resources without
72+
cleaning up old ones first. See [#64470][].
73+
74+
The design of this proposal can be later generalized to other finishable
75+
frequently-created, short-lived resources, such as completed Pods or finished
76+
custom resources.
77+
78+
[#64470]: https://github.com/kubernetes/kubernetes/issues/64470
79+
80+
### Goals
81+
82+
Make it easy to for the users to specify a time-based clean up mechanism for
83+
finished resource objects.
84+
* It's configurable at resource creation time and after the resource is created.
85+
86+
## Proposal
87+
88+
[K8s Proposal: TTL controller for finished Jobs and Pods][]
89+
90+
[K8s Proposal: TTL controller for finished Jobs and Pods]: https://docs.google.com/document/d/1U6h1DrRJNuQlL2_FYY_FdkQhgtTRn1kEylEOHRoESTc/edit
91+
92+
### Concrete Use Cases
93+
94+
* [Kubeflow][] needs to clean up old finished Jobs (K8s Jobs, TF Jobs, Argo
95+
workflows, etc.), see [#718][].
96+
97+
* [Prow][] needs to clean up old completed Pods & finished Jobs. Currently implemented with Prow sinker.
98+
99+
* [Apache Spark on Kubernetes][] needs proper cleanup of terminated Spark executor Pods.
100+
101+
* Jenkins Kubernetes plugin creates slave pods that execute builds. It needs a better way to clean up old completed Pods.
102+
103+
[Kubeflow]: https://github.com/kubeflow
104+
[#718]: https://github.com/kubeflow/tf-operator/issues/718
105+
[Prow]: https://github.com/kubernetes/test-infra/tree/master/prow
106+
[Apache Spark on Kubernetes]: http://spark.apache.org/docs/latest/running-on-kubernetes.html
107+
108+
### Detailed Design
109+
110+
#### Feature Gate
111+
112+
This will be launched as an alpha feature first, with feature gate
113+
`TTLAfterFinished`.
114+
115+
#### API Object
116+
117+
We will add the following API fields to `JobSpec` (`Job`'s `.spec`).
118+
119+
```go
120+
type JobSpec struct {
121+
// ttlSecondsAfterFinished limits the lifetime of a Job that has finished
122+
// execution (either Complete or Failed). If this field is set, once the Job
123+
// finishes, it will be deleted after ttlSecondsAfterFinished expires. When
124+
// the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will
125+
// be honored. If this field is unset, ttlSecondsAfterFinished will not
126+
// expire. If this field is set to zero, ttlSecondsAfterFinished expires
127+
// immediately after the Job finishes.
128+
// This field is alpha-level and is only honored by servers that enable the
129+
// TTLAfterFinished feature.
130+
// +optional
131+
TTLSecondsAfterFinished *int32
132+
}
133+
```
134+
135+
This allows Jobs to be cleaned up after they finish and provides time for
136+
asynchronous clients to observe Jobs' final states before they are deleted.
137+
138+
139+
Similarly, we will add the following API fields to `PodSpec` (`Pod`'s `.spec`).
140+
141+
```go
142+
type PodSpec struct {
143+
// ttlSecondsAfterFinished limits the lifetime of a Pod that has finished
144+
// execution (either Succeeded or Failed). If this field is set, once the Pod
145+
// finishes, it will be deleted after ttlSecondsAfterFinished expires. When
146+
// the Pod is being deleted, its lifecycle guarantees (e.g. finalizers) will
147+
// be honored. If this field is unset, ttlSecondsAfterFinished will not
148+
// expire. If this field is set to zero, ttlSecondsAfterFinished expires
149+
// immediately after the Pod finishes.
150+
// This field is alpha-level and is only honored by servers that enable the
151+
// TTLAfterFinished feature.
152+
// +optional
153+
TTLSecondsAfterFinished *int32
154+
}
155+
```
156+
157+
##### Validation
158+
159+
Because Job controller depends on Pods to exist to work correctly. In Job
160+
validation, `ttlSecondsAfterFinished` of its pod template shouldn't be set, to
161+
prevent users from breaking their Jobs. Users should set TTL seconds on a Job,
162+
instead of Pods owned by a Job.
163+
164+
It is common for higher level resources to call generic PodSpec validation;
165+
therefore, in PodSpec validation, `ttlSecondsAfterFinished` is only allowed to
166+
be set on a PodSpec with a `restartPolicy` that is either `OnFailure` or `Never`
167+
(i.e. not `Always`).
168+
169+
### User Stories
170+
171+
The users keep creating Jobs in a small Kubernetes cluster with 4 nodes.
172+
The Jobs accumulates over time, and 1 year later, the cluster ended up with more
173+
than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests,
174+
and eventually made the cluster unavailable.
175+
176+
The problem could have been avoided easily with TTL controller for Jobs.
177+
178+
The steps are as easy as:
179+
180+
1. When creating Jobs, the user sets Jobs' `.spec.ttlSecondsAfterFinished` to
181+
3600 (i.e. 1 hour).
182+
1. The user deploys Jobs as usual.
183+
1. After a Job finishes, the result is observed asynchronously within an hour
184+
and stored elsewhere.
185+
1. The TTL collector cleans up Jobs 1 hour after they complete.
186+
187+
### Implementation Details/Notes/Constraints
188+
189+
#### TTL Controller
190+
We will add a TTL controller for finished Jobs and finished Pods. We considered
191+
adding it in Job controller, but decided not to, for the following reasons:
192+
193+
1. Job controller should focus on managing Pods based on the Job's spec and pod
194+
template, but not cleaning up Jobs.
195+
1. We also need the TTL controller to clean up finished Pods, and we consider
196+
generalizing TTL controller later for custom resources.
197+
198+
The TTL controller utilizes informer framework, watches all Jobs and Pods, and
199+
read Jobs and Pods from a local cache.
200+
201+
#### Finished Jobs
202+
203+
When a Job is created or updated:
204+
205+
1. Check its `.status.conditions` to see if it has finished (`Complete` or
206+
`Failed`). If it hasn't finished, do nothing.
207+
1. Otherwise, if the Job has finished, check if Job's
208+
`.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is
209+
not set.
210+
1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e.
211+
`.spec.ttlSecondsAfterFinished` + the time when the Job finishes
212+
(`.status.conditions.lastTransitionTime`) > now.
213+
1. If the TTL hasn't expired, delay re-enqueuing the Job after a computed amount
214+
of time when it will expire. The computed time period is:
215+
(`.spec.ttlSecondsAfterFinished` + `.status.conditions.lastTransitionTime` -
216+
now).
217+
1. If the TTL has expired, `GET` the Job from API server to do final sanity
218+
checks before deleting it.
219+
1. Check if the freshly got Job's TTL has expired. This field may be updated
220+
before TTL controller observes the new value in its local cache.
221+
* If it hasn't expired, it is not safe to delete the Job. Delay re-enqueue
222+
the Job after a computed amount of time when it will expire.
223+
1. Delete the Job if passing the sanity checks.
224+
225+
#### Finished Pods
226+
227+
When a Pod is created or updated:
228+
1. Check its `.status.phase` to see if it has finished (`Succeeded` or `Failed`).
229+
If it hasn't finished, do nothing.
230+
1. Otherwise, if the Pod has finished, check if Pod's
231+
`.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is
232+
not set.
233+
1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e.
234+
`.spec.ttlSecondsAfterFinished` + the time when the Pod finishes (max of all
235+
of its containers termination time
236+
`.containerStatuses.state.terminated.finishedAt`) > now.
237+
1. If the TTL hasn't expired, delay re-enqueuing the Pod after a computed amount
238+
of time when it will expire. The computed time period is:
239+
(`.spec.ttlSecondsAfterFinished` + the time when the Pod finishes - now).
240+
1. If the TTL has expired, `GET` the Pod from API server to do final sanity
241+
checks before deleting it.
242+
1. Check if the freshly got Pod's TTL has expired. This field may be updated
243+
before TTL controller observes the new value in its local cache.
244+
* If it hasn't expired, it is not safe to delete the Pod. Delay re-enqueue
245+
the Pod after a computed amount of time when it will expire.
246+
1. Delete the Pod if passing the sanity checks.
247+
248+
#### Owner References
249+
250+
We have considered making TTL controller leave a Job/Pod around even after its
251+
TTL expires, if the Job/Pod has any owner specified in its
252+
`.metadata.ownerReferences`.
253+
254+
We decided not to block deletion on owners, because the purpose of
255+
`.metadata.ownerReferences` is for cascading deletion, but not for keeping an
256+
owner's dependents alive. If the Job is owned by a CronJob, the Job can be
257+
cleaned up based on CronJob's history limit (i.e. the number of dependent Jobs
258+
to keep), or CronJob can choose not to set history limit but set the TTL of its
259+
Job template to clean up Jobs after TTL expires instead of based on the history
260+
limit capacity.
261+
262+
Therefore, a Job/Pod can be deleted after its TTL expires, even if it still has
263+
owners.
264+
265+
Similarly, the TTL won't block deletion from generic garbage collector. This
266+
means that when a Job's or Pod's owners are gone, generic garbage collector will
267+
delete it, even if it hasn't finished or its TTL hasn't expired.
268+
269+
### Risks and Mitigations
270+
271+
Risks:
272+
* Time skew may cause TTL controller to clean up resource objects at the wrong
273+
time.
274+
275+
Mitigations:
276+
* In Kubernetes, it's required to run NTP on all nodes ([#6159][]) to avoid time
277+
skew. We will also document this risk.
278+
279+
[#6159]: https://github.com/kubernetes/kubernetes/issues/6159#issuecomment-93844058
280+
281+
## Graduation Criteria
282+
283+
We want to implement this feature for Pods/Jobs first to gather feedback, and
284+
decide whether to generalize it to custom resources. This feature can be
285+
promoted to beta after we finalize the decision for whether to generalize it or
286+
not, and when it satisfies users' need for cleaning up finished resource
287+
objects, without regressions.
288+
289+
This will be promoted to GA once it's gone a sufficient amount of time as beta
290+
with no changes.
291+
292+
[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752
293+
294+
## Implementation History
295+
296+
TBD

0 commit comments

Comments
 (0)