9
9
- [ Proposal] ( #proposal )
10
10
- [ User Stories (Optional)] ( #user-stories-optional )
11
11
- [ Story 1] ( #story-1 )
12
+ - [ Story 2] ( #story-2 )
12
13
- [ Notes/Constraints/Caveats (Optional)] ( #notesconstraintscaveats-optional )
13
14
- [ Risks and Mitigations] ( #risks-and-mitigations )
14
15
- [ Design Details] ( #design-details )
@@ -58,19 +59,21 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
58
59
59
60
## Summary
60
61
61
- This KEP extends kubernetes with user-friendly support for running
62
- embarrassingly parallel jobs.
62
+ This KEP extends kubernetes with user-friendly support for running parallel jobs.
63
63
64
- Here, parallel means multiple pods. By embarrassingly parallel, it means that
65
- the pods have no dependencies between each other.
66
- In particular, neither ordering between pods nor gang scheduling are supported.
64
+ Here, parallel means multiple pods per Job. Jobs can be:
65
+ - Embarrassingly parallel, where the pods have no dependencies between each other.
66
+ - Tightly coupled, where the Pods communicate among themselves to make progress
67
+ (kubernetes/kubernetes #99497 )[ https://github.com/kubernetes/kubernetes/issues/99497 ]
67
68
68
69
We propose the addition of completion indexes into the Pods of a * Job
69
70
[ with fixed completion count] * to support running embarrassingly parallel
70
- programs, with a focus on ease of use.
71
+ programs, with a focus on ease of use for workload partitioning .
71
72
We call this new Job pattern an * Indexed Job* , because each Pod of the Job
72
73
specializes to work on a particular index, as if the Pods where elements of an
73
74
array.
75
+ With the addition of a headless Service, Pods can address another Pod with a
76
+ specific index with a DNS lookup, because the index is part of the hostname.
74
77
75
78
[ with fixed completion count ] : https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs
76
79
@@ -94,18 +97,45 @@ own APIs and controllers or adopt third party implementations. Each
94
97
implementation splits the ecosystem, making it harder for higher level systems
95
98
for Job queueing or workflows to support all of them.
96
99
100
+ Additionally, the Pods within a Job can't easily address and communicate with
101
+ each other, making it hard to run tightly coupled parallel Jobs using the Job
102
+ API.
103
+
104
+ Third-party operators cover these use cases by defining their own APIs, leading
105
+ to fragmentation of the ecosystem. The operators use mainly two networking
106
+ patterns: (1) fronting each index with a Service or (2) creating Pods with
107
+ stable hostnames based on their index.
108
+
109
+ Using a Service per index has scalability problems. Other than the Service
110
+ objects themselves, the control plane creates an Endpoint object.
111
+
112
+ Creating Pods with stable hostnames mitigates this problem. The control plane
113
+ requires only one headless Service and one Endpoint (or a few EndpointSlices) to
114
+ inform the DNS programming. Pods can address each other with a DNS lookup and
115
+ communicate directly using Pod IPs.
116
+
117
+ A popular operator chose to use a StatefulSet to handle Pod creation and
118
+ management with these characteristics. Due to limitations, the operator now
119
+ manages plain pods. These limitations of StatefulSet were:
120
+ - Pods are created serially.
121
+ - Pods can be replaced without leaving notice of failures.
122
+ - Pods cannot run to completion (containers restart on success or failure).
123
+
97
124
[ Job patterns ] : https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns
98
125
99
126
### Goals
100
127
101
128
- Support the * indexed Job* pattern by adding completion indexes to each Pod
102
129
of a Job in * fixed completion count* mode.
130
+ - Add stable hostnames to Pods based on the index to simplify communication
131
+ among themselves.
103
132
104
133
### Non-Goals
105
134
106
135
- Support for work lists, where each Pod receives a different element of a
107
136
static list. This can be implemented by users from completion indexes.
108
137
- Support for completion index in non-parallel Jobs or Jobs with a work queue.
138
+ - Network programming for indexed Jobs. This is left to headless Services.
109
139
- All-or-nothing scheduling.
110
140
111
141
## Proposal
@@ -114,29 +144,62 @@ for Job queueing or workflows to support all of them.
114
144
115
145
#### Story 1
116
146
117
- As a Job author, I can create an array Job where each Pod receives an ordered
147
+ As a Job author, I can create an Indexed Job where each Pod receives an ordered
118
148
completion index. I can use the index in my binary through an environment
119
149
variable or a file to statically select the load the Pod should work on.
120
150
121
151
``` yaml
122
152
apiVersion : batch/v1
123
153
kind : Job
124
154
metadata :
125
- name : parallel-work
155
+ name : my-job
126
156
spec :
127
157
completions : 100
128
158
parallelism : 100
159
+ completionMode : Indexed
129
160
template :
130
161
spec :
131
162
containers :
132
163
- name : task
133
164
image : registry.example.com/processing-image
134
- command : ["./process", "--index", "$INDEX"]
135
- env :
136
- - name : INDEX
137
- valueFrom :
138
- fieldRef :
139
- fieldPath : metadata.annotations['batch.kubernetes.io/job-completion-index']
165
+ command : ["./process", "--index", "$JOB_COMPLETION_INDEX"]
166
+ ` ` `
167
+
168
+ #### Story 2
169
+
170
+ As a Job author, I can create an Indexed Job where pods can address each other
171
+ by the hostname that can be built from the index.
172
+
173
+ ` ` ` yaml
174
+ apiVersion : batch/v1
175
+ kind : Job
176
+ metadata :
177
+ name : my-job
178
+ spec :
179
+ completions : 100
180
+ parallelism : 100
181
+ completionMode : Indexed
182
+ template :
183
+ metadata :
184
+ labels :
185
+ job : my-job
186
+ spec :
187
+ subdomain : my-job-svc
188
+ containers :
189
+ - name : task
190
+ image : registry.example.com/processing-image
191
+ command : ["./process", "--index", "$JOB_COMPLETION_INDEX", "--hosts-pattern", "my-job-{{.id}}.my-job-svc"]
192
+ ` ` `
193
+
194
+ ` ` ` yaml
195
+ apiVersion : v1
196
+ kind : Service
197
+ metadata :
198
+ name : my-job-svc
199
+ spec :
200
+ clusterIP : None
201
+ selector :
202
+ job : my-job
140
203
` ` `
141
204
142
205
### Notes/Constraints/Caveats (Optional)
@@ -148,14 +211,18 @@ because work lists can be implemented in a startup script using the completion
148
211
index as building block.
149
212
* The semantics of an indexed Job are similar to a StatefulSet, in the sense
150
213
that Pods have an associated index.
151
- However, the APIs have a major difference: a StatefulSet doesn't have completion
152
- semantics, as opposed to Jobs.
214
+ However, the APIs have major differences:
215
+ - a StatefulSet doesn't have completion semantics, as opposed to Jobs.
216
+ - a StatefulSet creates pods serially, whereas Job creates all Pods in
217
+ parallel.
218
+ - a StatefulSet gives Pods stable hostnames, a Job doesn't.
153
219
154
220
[indexed Job]: https://github.com/kubernetes/community/blob/b21d1b27c8c748bf81283c2d89cde2becb5f2709/contributors/design-proposals/apps/indexed-job.md
155
221
156
222
### Risks and Mitigations
157
223
158
224
- More than one pod per index
225
+
159
226
Jobs have a known issue in which more than one Pod can be started even if
160
227
parallelism and completion are set to 1 ([reference]). In the case of indexed
161
228
Jobs, this translates to more than one Pod having the same index.
@@ -172,6 +239,29 @@ semantics, as opposed to Jobs.
172
239
Pods. The controller processes the remaining operations in subsequent syncs,
173
240
which it schedules with no delay.
174
241
242
+ - Scalability and latency of DNS programming, if users choose to pair the
243
+ Indexed Job with a headless service.
244
+
245
+ DNS programming requires the update of Endpoint or EndpointSlices by the
246
+ control plane and updating DNS records by the DNS provider.
247
+ This might not scale well for short-lived Jobs with high number of
248
+ parallelism.
249
+
250
+ Thus, Pods need to be prepared to:
251
+ - Retry lookups, when the control plane didn't have time to update the records.
252
+ - Handle the IPs for a CNAME to change, in the case of a Pod failure.
253
+ - Handle more than one IP for the CNAME. This might happen temporarily when
254
+ the job controller creates more than one pod per index. The controller
255
+ corrects this in the next sync, deleting the Pod that started last, which
256
+ should correspond to the last IP added to the record.
257
+ In short, Pods are ephemeral and resolutions might change, so users shouldn't
258
+ rely on DNS caches.
259
+
260
+ However, DNS programming is opt-in (users need to create a matching
261
+ headless Service). Moreover, workloads have other means of obtaining IPs,
262
+ such as querying/watching the API server. Vendors can also choose to implement
263
+ alternate DNS programming tailored for Jobs.
264
+
175
265
[reference]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures
176
266
177
267
## Design Details
@@ -207,6 +297,8 @@ type JobSpec struct {
207
297
// `Indexed` means that the Pods of a
208
298
// Job get an associated completion index from 0 to (.spec.completions - 1),
209
299
// available in the annotation batch.kubernetes.io/job-completion-index.
300
+ // The Pod hostnames are set to $(job-name)-$(index) and the names to
301
+ // $(job-name)-$(index)-$(random-suffix).
210
302
// The Job is considered complete when there is one successfully completed Pod
211
303
// for each index.
212
304
// When value is `Indexed`, .spec.completions must be specified and
@@ -269,6 +361,14 @@ The Job controller doesn't add the environment variable if there is a name
269
361
conflict with an existing environment variable. Users can specify other
270
362
environment variables for the same annotation.
271
363
364
+ The Pod name takes the form ` $(job-name)-$(index)-$(random-string)`,
365
+ which can be used for quickly identifying Pods for a specific index when listing
366
+ pods or looking at logs.
367
+
368
+ The Pod hostname takes the form `$(job-name)-$(index)` which can be used to
369
+ address the Pod from others, when the Job is used in combination with a headless
370
+ Service.
371
+
272
372
# ## Job completion and restart policy
273
373
274
374
When dealing with Indexed Jobs, the Job controller keeps track of Pod
@@ -327,7 +427,7 @@ Reducing parallelism is unaffected by completion index.
327
427
328
428
Unit, integration and E2E tests cover the following Indexed Job mechanics :
329
429
330
- - Creation with indexed Pod names and index annotations .
430
+ - Creation with index annotations and indexed pod hostnames .
331
431
- Scale up and down.
332
432
- Pod failures.
333
433
@@ -345,6 +445,7 @@ gate enabled and disabled.
345
445
# ### Alpha -> Beta Graduation
346
446
347
447
- Complete features :
448
+ - Index as part of the pod name and hostname.
348
449
- Indexed Jobs when tracking completion with finalizers.
349
450
[kubernetes/enhancements#2307](https://github.com/kubernetes/enhancements/issues/2307).
350
451
@@ -439,9 +540,10 @@ _This section must be completed when targeting beta graduation to a release._
439
540
* **What specific metrics should inform a rollback?**
440
541
441
542
- job_sync_duration_seconds shows significantly more latency for label
442
- mode=Indexed Jobs than mode=NonIndexed.
443
- - job_sync_total shows more errors for mode=Indexed than mode=NonIndexed.
444
- - job_finished_total shows that Jobs with mode=Indexed don't finish.
543
+ completion_mode=Indexed Jobs than completion_mode=NonIndexed.
544
+ - job_sync_total shows more errors for completion_mode=Indexed than
545
+ completion_mode=NonIndexed.
546
+ - job_finished_total shows that Jobs with completion_mode=Indexed don't finish.
445
547
446
548
* **Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?**
447
549
@@ -464,7 +566,7 @@ _This section must be completed when targeting beta graduation to a release._
464
566
465
567
* **How can an operator determine if the feature is in use by workloads?**
466
568
467
- - job_sync_total has values for the label mode =Indexed.
569
+ - job_sync_total has values for the label completion_mode =Indexed.
468
570
469
571
* **What are the SLIs (Service Level Indicators) an operator can use to determine
470
572
the health of the service?**
@@ -534,7 +636,8 @@ the existing API objects?**
534
636
than 1MB.
535
637
536
638
- API type(s) : Pod, only when created with the new completion mode.
537
- - Estimated increase in size : new annotation of about 50 bytes.
639
+ - Estimated increase in size : new annotation of about 50 bytes and hostname
640
+ which includes the index.
538
641
539
642
* **Will enabling / using this feature result in increasing time taken by any
540
643
operations covered by [existing SLIs/SLOs]?**
@@ -606,11 +709,9 @@ _This section must be completed when targeting beta graduation to a release._
606
709
607
710
Completion indexes could also be part of the Pod name, leading to stable Pod
608
711
names. This allows 2 things :
609
- - Uniqueness for each completion index, freeing applications from having to
610
- handle duplicated indexes.
611
- - Predictable hostnames, which benefits applications that need to communicate
612
- to Pods of a Job (or among Pods of the same Job) without having to do
613
- discovery.
712
+ - Uniqueness for each completion index. This frees applications from having to
713
+ handle duplicated indexes. When used along with a headless Service, there
714
+ are less chances for a DNS record to refer to more than one Pod.
614
715
615
716
Stable pod names require the Job controller to remove failed Pods before
616
717
creating a new one with the same index. This has some downsides :
@@ -620,11 +721,9 @@ _This section must be completed when targeting beta graduation to a release._
620
721
the status of the Job, affecting retry backoffs and backoff limit. This
621
722
needs to change before stable Pod names can be implemented
622
723
[#28486](https://github.com/kubernetes/kubernetes/issues/28486).
623
- - Reduced availability of Job Pods per completion index. This happens when
624
- a Node becomes unavailable. The Job controller cannot remove such Pods.
625
- Either the kubelet in the Node recovers and marks the Pod as failed; or the
626
- kube-apiserver removes the Node and the garbage collector removes the orphan
627
- Pods.
724
+ - Reduced availability of Job Pods per completion index as, in addition to
725
+ the time necessary to create a new Pod, we need to account for the time of
726
+ deleting the failed Pod.
628
727
629
728
However, stable Pod names can be offered later as a new value for
630
729
` .spec.completionMode` for Jobs.
0 commit comments