|
| 1 | +--- |
| 2 | +kep-number: 26 |
| 3 | +title: TTL After Finished |
| 4 | +authors: |
| 5 | + - "@janetkuo" |
| 6 | +owning-sig: sig-apps |
| 7 | +participating-sigs: |
| 8 | + - sig-api-machinery |
| 9 | +reviewers: |
| 10 | + - @enisoc |
| 11 | + - @tnozicka |
| 12 | +approvers: |
| 13 | + - @kow3ns |
| 14 | +editor: TBD |
| 15 | +creation-date: 2018-08-16 |
| 16 | +last-updated: 2018-08-16 |
| 17 | +status: provisional |
| 18 | +see-also: |
| 19 | + - n/a |
| 20 | +replaces: |
| 21 | + - n/a |
| 22 | +superseded-by: |
| 23 | + - n/a |
| 24 | +--- |
| 25 | + |
| 26 | +# TTL After Finished Controller |
| 27 | + |
| 28 | +## Table of Contents |
| 29 | + |
| 30 | +A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template. |
| 31 | +[Tools for generating][] a table of contents from markdown are available. |
| 32 | + |
| 33 | + * [TTL After Finished Controller](#ttl-after-finished-controller) |
| 34 | + * [Table of Contents](#table-of-contents) |
| 35 | + * [Summary](#summary) |
| 36 | + * [Motivation](#motivation) |
| 37 | + * [Goals](#goals) |
| 38 | + * [Proposal](#proposal) |
| 39 | + * [Concrete Use Cases](#concrete-use-cases) |
| 40 | + * [Detailed Design](#detailed-design) |
| 41 | + * [Feature Gate](#feature-gate) |
| 42 | + * [API Object](#api-object) |
| 43 | + * [Validation](#validation) |
| 44 | + * [User Stories](#user-stories) |
| 45 | + * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) |
| 46 | + * [TTL Controller](#ttl-controller) |
| 47 | + * [Finished Jobs](#finished-jobs) |
| 48 | + * [Finished Pods](#finished-pods) |
| 49 | + * [Owner References](#owner-references) |
| 50 | + * [Risks and Mitigations](#risks-and-mitigations) |
| 51 | + * [Graduation Criteria](#graduation-criteria) |
| 52 | + * [Implementation History](#implementation-history) |
| 53 | + |
| 54 | +[Tools for generating]: https://github.com/ekalinin/github-markdown-toc |
| 55 | + |
| 56 | +## Summary |
| 57 | + |
| 58 | +We propose a TTL mechanism to limit the lifetime of finished resource objects, |
| 59 | +including Jobs and Pods, to make it easy for users to clean up old Jobs/Pods |
| 60 | +after they finish. The TTL timer starts when the Job/Pod finishes, and the |
| 61 | +finished Job/Pod will be cleaned up after the TTL expires. |
| 62 | + |
| 63 | +## Motivation |
| 64 | + |
| 65 | +In Kubernetes, finishable resources, such as Jobs and Pods, are often |
| 66 | +frequently-created and short-lived. If a Job or Pod isn't controlled by a |
| 67 | +higher-level resource (e.g. CronJob for Jobs or Job for Pods), or owned by some |
| 68 | +other resources, it's difficult for the users to clean them up automatically, |
| 69 | +and those Jobs and Pods can accumulate and overload a Kubernetes cluster very |
| 70 | +easily. Even if we can avoid the overload issue by implementing a cluster-wide |
| 71 | +(global) resource quota, users won't be able to create new resources without |
| 72 | +cleaning up old ones first. See [#64470][]. |
| 73 | + |
| 74 | +The design of this proposal can be later generalized to other finishable |
| 75 | +frequently-created, short-lived resources, such as completed Pods or finished |
| 76 | +custom resources. |
| 77 | + |
| 78 | +[#64470]: https://github.com/kubernetes/kubernetes/issues/64470 |
| 79 | + |
| 80 | +### Goals |
| 81 | + |
| 82 | +Make it easy to for the users to specify a time-based clean up mechanism for |
| 83 | +finished resource objects. |
| 84 | +* It's configurable at resource creation time and after the resource is created. |
| 85 | + |
| 86 | +## Proposal |
| 87 | + |
| 88 | +[K8s Proposal: TTL controller for finished Jobs and Pods][] |
| 89 | + |
| 90 | +[K8s Proposal: TTL controller for finished Jobs and Pods]: https://docs.google.com/document/d/1U6h1DrRJNuQlL2_FYY_FdkQhgtTRn1kEylEOHRoESTc/edit |
| 91 | + |
| 92 | +### Concrete Use Cases |
| 93 | + |
| 94 | +* [Kubeflow][] needs to clean up old finished Jobs (K8s Jobs, TF Jobs, Argo |
| 95 | + workflows, etc.), see [#718][]. |
| 96 | + |
| 97 | +* [Prow][] needs to clean up old completed Pods & finished Jobs. Currently implemented with Prow sinker. |
| 98 | + |
| 99 | +* [Apache Spark on Kubernetes][] needs proper cleanup of terminated Spark executor Pods. |
| 100 | + |
| 101 | +* Jenkins Kubernetes plugin creates slave pods that execute builds. It needs a better way to clean up old completed Pods. |
| 102 | + |
| 103 | +[Kubeflow]: https://github.com/kubeflow |
| 104 | +[#718]: https://github.com/kubeflow/tf-operator/issues/718 |
| 105 | +[Prow]: https://github.com/kubernetes/test-infra/tree/master/prow |
| 106 | +[Apache Spark on Kubernetes]: http://spark.apache.org/docs/latest/running-on-kubernetes.html |
| 107 | + |
| 108 | +### Detailed Design |
| 109 | + |
| 110 | +#### Feature Gate |
| 111 | + |
| 112 | +This will be launched as an alpha feature first, with feature gate |
| 113 | +`TTLAfterFinished`. |
| 114 | + |
| 115 | +#### API Object |
| 116 | + |
| 117 | +We will add the following API fields to `JobSpec` (`Job`'s `.spec`). |
| 118 | + |
| 119 | +```go |
| 120 | +type JobSpec struct { |
| 121 | + // ttlSecondsAfterFinished limits the lifetime of a Job that has finished |
| 122 | + // execution (either Complete or Failed). If this field is set, once the Job |
| 123 | + // finishes, it will be deleted after ttlSecondsAfterFinished expires. When |
| 124 | + // the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will |
| 125 | + // be honored. If this field is unset, ttlSecondsAfterFinished will not |
| 126 | + // expire. If this field is set to zero, ttlSecondsAfterFinished expires |
| 127 | + // immediately after the Job finishes. |
| 128 | + // This field is alpha-level and is only honored by servers that enable the |
| 129 | + // TTLAfterFinished feature. |
| 130 | + // +optional |
| 131 | + TTLSecondsAfterFinished *int32 |
| 132 | +} |
| 133 | +``` |
| 134 | + |
| 135 | +This allows Jobs to be cleaned up after they finish and provides time for |
| 136 | +asynchronous clients to observe Jobs' final states before they are deleted. |
| 137 | + |
| 138 | + |
| 139 | +Similarly, we will add the following API fields to `PodSpec` (`Pod`'s `.spec`). |
| 140 | + |
| 141 | +```go |
| 142 | +type PodSpec struct { |
| 143 | + // ttlSecondsAfterFinished limits the lifetime of a Pod that has finished |
| 144 | + // execution (either Succeeded or Failed). If this field is set, once the Pod |
| 145 | + // finishes, it will be deleted after ttlSecondsAfterFinished expires. When |
| 146 | + // the Pod is being deleted, its lifecycle guarantees (e.g. finalizers) will |
| 147 | + // be honored. If this field is unset, ttlSecondsAfterFinished will not |
| 148 | + // expire. If this field is set to zero, ttlSecondsAfterFinished expires |
| 149 | + // immediately after the Pod finishes. |
| 150 | + // This field is alpha-level and is only honored by servers that enable the |
| 151 | + // TTLAfterFinished feature. |
| 152 | + // +optional |
| 153 | + TTLSecondsAfterFinished *int32 |
| 154 | +} |
| 155 | +``` |
| 156 | + |
| 157 | +##### Validation |
| 158 | + |
| 159 | +Because Job controller depends on Pods to exist to work correctly. In Job |
| 160 | +validation, `ttlSecondsAfterFinished` of its pod template shouldn't be set, to |
| 161 | +prevent users from breaking their Jobs. Users should set TTL seconds on a Job, |
| 162 | +instead of Pods owned by a Job. |
| 163 | + |
| 164 | +It is common for higher level resources to call generic PodSpec validation; |
| 165 | +therefore, in PodSpec validation, `ttlSecondsAfterFinished` is only allowed to |
| 166 | +be set on a PodSpec with a `restartPolicy` that is either `OnFailure` or `Never` |
| 167 | +(i.e. not `Always`). |
| 168 | + |
| 169 | +### User Stories |
| 170 | + |
| 171 | +The users keep creating Jobs in a small Kubernetes cluster with 4 nodes. |
| 172 | +The Jobs accumulates over time, and 1 year later, the cluster ended up with more |
| 173 | +than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests, |
| 174 | +and eventually made the cluster unavailable. |
| 175 | + |
| 176 | +The problem could have been avoided easily with TTL controller for Jobs. |
| 177 | + |
| 178 | +The steps are as easy as: |
| 179 | + |
| 180 | +1. When creating Jobs, the user sets Jobs' `.spec.ttlSecondsAfterFinished` to |
| 181 | + 3600 (i.e. 1 hour). |
| 182 | +1. The user deploys Jobs as usual. |
| 183 | +1. After a Job finishes, the result is observed asynchronously within an hour |
| 184 | + and stored elsewhere. |
| 185 | +1. The TTL collector cleans up Jobs 1 hour after they complete. |
| 186 | + |
| 187 | +### Implementation Details/Notes/Constraints |
| 188 | + |
| 189 | +#### TTL Controller |
| 190 | +We will add a TTL controller for finished Jobs and finished Pods. We considered |
| 191 | +adding it in Job controller, but decided not to, for the following reasons: |
| 192 | + |
| 193 | +1. Job controller should focus on managing Pods based on the Job's spec and pod |
| 194 | + template, but not cleaning up Jobs. |
| 195 | +1. We also need the TTL controller to clean up finished Pods, and we consider |
| 196 | + generalizing TTL controller later for custom resources. |
| 197 | + |
| 198 | +The TTL controller utilizes informer framework, watches all Jobs and Pods, and |
| 199 | +read Jobs and Pods from a local cache. |
| 200 | + |
| 201 | +#### Finished Jobs |
| 202 | + |
| 203 | +When a Job is created or updated: |
| 204 | + |
| 205 | +1. Check its `.status.conditions` to see if it has finished (`Complete` or |
| 206 | + `Failed`). If it hasn't finished, do nothing. |
| 207 | +1. Otherwise, if the Job has finished, check if Job's |
| 208 | + `.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is |
| 209 | + not set. |
| 210 | +1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e. |
| 211 | + `.spec.ttlSecondsAfterFinished` + the time when the Job finishes |
| 212 | + (`.status.conditions.lastTransitionTime`) > now. |
| 213 | +1. If the TTL hasn't expired, delay re-enqueuing the Job after a computed amount |
| 214 | + of time when it will expire. The computed time period is: |
| 215 | + (`.spec.ttlSecondsAfterFinished` + `.status.conditions.lastTransitionTime` - |
| 216 | + now). |
| 217 | +1. If the TTL has expired, `GET` the Job from API server to do final sanity |
| 218 | + checks before deleting it. |
| 219 | +1. Check if the freshly got Job's TTL has expired. This field may be updated |
| 220 | + before TTL controller observes the new value in its local cache. |
| 221 | + * If it hasn't expired, it is not safe to delete the Job. Delay re-enqueue |
| 222 | + the Job after a computed amount of time when it will expire. |
| 223 | +1. Delete the Job if passing the sanity checks. |
| 224 | + |
| 225 | +#### Finished Pods |
| 226 | + |
| 227 | +When a Pod is created or updated: |
| 228 | +1. Check its `.status.phase` to see if it has finished (`Succeeded` or `Failed`). |
| 229 | + If it hasn't finished, do nothing. |
| 230 | +1. Otherwise, if the Pod has finished, check if Pod's |
| 231 | + `.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is |
| 232 | + not set. |
| 233 | +1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e. |
| 234 | + `.spec.ttlSecondsAfterFinished` + the time when the Pod finishes (max of all |
| 235 | + of its containers termination time |
| 236 | + `.containerStatuses.state.terminated.finishedAt`) > now. |
| 237 | +1. If the TTL hasn't expired, delay re-enqueuing the Pod after a computed amount |
| 238 | + of time when it will expire. The computed time period is: |
| 239 | + (`.spec.ttlSecondsAfterFinished` + the time when the Pod finishes - now). |
| 240 | +1. If the TTL has expired, `GET` the Pod from API server to do final sanity |
| 241 | + checks before deleting it. |
| 242 | +1. Check if the freshly got Pod's TTL has expired. This field may be updated |
| 243 | + before TTL controller observes the new value in its local cache. |
| 244 | + * If it hasn't expired, it is not safe to delete the Pod. Delay re-enqueue |
| 245 | + the Pod after a computed amount of time when it will expire. |
| 246 | +1. Delete the Pod if passing the sanity checks. |
| 247 | + |
| 248 | +#### Owner References |
| 249 | + |
| 250 | +We have considered making TTL controller leave a Job/Pod around even after its |
| 251 | +TTL expires, if the Job/Pod has any owner specified in its |
| 252 | +`.metadata.ownerReferences`. |
| 253 | + |
| 254 | +We decided not to block deletion on owners, because the purpose of |
| 255 | +`.metadata.ownerReferences` is for cascading deletion, but not for keeping an |
| 256 | +owner's dependents alive. If the Job is owned by a CronJob, the Job can be |
| 257 | +cleaned up based on CronJob's history limit (i.e. the number of dependent Jobs |
| 258 | +to keep), or CronJob can choose not to set history limit but set the TTL of its |
| 259 | +Job template to clean up Jobs after TTL expires instead of based on the history |
| 260 | +limit capacity. |
| 261 | + |
| 262 | +Therefore, a Job/Pod can be deleted after its TTL expires, even if it still has |
| 263 | +owners. |
| 264 | + |
| 265 | +Similarly, the TTL won't block deletion from generic garbage collector. This |
| 266 | +means that when a Job's or Pod's owners are gone, generic garbage collector will |
| 267 | +delete it, even if it hasn't finished or its TTL hasn't expired. |
| 268 | + |
| 269 | +### Risks and Mitigations |
| 270 | + |
| 271 | +Risks: |
| 272 | +* Time skew may cause TTL controller to clean up resource objects at the wrong |
| 273 | + time. |
| 274 | + |
| 275 | +Mitigations: |
| 276 | +* In Kubernetes, it's required to run NTP on all nodes ([#6159][]) to avoid time |
| 277 | + skew. We will also document this risk. |
| 278 | + |
| 279 | +[#6159]: https://github.com/kubernetes/kubernetes/issues/6159#issuecomment-93844058 |
| 280 | + |
| 281 | +## Graduation Criteria |
| 282 | + |
| 283 | +We want to implement this feature for Pods/Jobs first to gather feedback, and |
| 284 | +decide whether to generalize it to custom resources. This feature can be |
| 285 | +promoted to beta after we finalize the decision for whether to generalize it or |
| 286 | +not, and when it satisfies users' need for cleaning up finished resource |
| 287 | +objects, without regressions. |
| 288 | + |
| 289 | +This will be promoted to GA once it's gone a sufficient amount of time as beta |
| 290 | +with no changes. |
| 291 | + |
| 292 | +[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752 |
| 293 | + |
| 294 | +## Implementation History |
| 295 | + |
| 296 | +TBD |
0 commit comments