@@ -58,19 +58,20 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
58
58
59
59
## Summary
60
60
61
- This KEP extends kubernetes with user-friendly support for running
62
- embarrassingly parallel jobs.
61
+ This KEP extends kubernetes with user-friendly support for running parallel jobs.
63
62
64
- Here, parallel means multiple pods. By embarrassingly parallel, it means that
65
- the pods have no dependencies between each other.
66
- In particular, neither ordering between pods nor gang scheduling are supported .
63
+ Here, parallel means multiple pods per Job. Jobs can be:
64
+ - Embarrassingly parallel, where the pods have no dependencies between each other.
65
+ - Tightly coupled, where the Pods communicate among themselves to make progress .
67
66
68
67
We propose the addition of completion indexes into the Pods of a * Job
69
68
[ with fixed completion count] * to support running embarrassingly parallel
70
- programs, with a focus on ease of use.
69
+ programs, with a focus on ease of use for workload partitioning .
71
70
We call this new Job pattern an * Indexed Job* , because each Pod of the Job
72
71
specializes to work on a particular index, as if the Pods where elements of an
73
72
array.
73
+ With the addition of a headless Service, Pods can address another Pod with a
74
+ specific index with a DNS lookup, because the index is part of the hostname.
74
75
75
76
[ with fixed completion count ] : https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs
76
77
@@ -94,18 +95,46 @@ own APIs and controllers or adopt third party implementations. Each
94
95
implementation splits the ecosystem, making it harder for higher level systems
95
96
for Job queueing or workflows to support all of them.
96
97
98
+ Additionally, the Pods within a Job can't easily address and communicate with
99
+ each other, making it hard to run tightly coupled parallel Jobs using the Job
100
+ API.
101
+
102
+ Third-party operators cover these use cases by defining their own APIs, leading
103
+ to fragmentation of the ecosystem. The operators use mainly two networking
104
+ patterns: (1) fronting each index with a Service or (2) creating Pods with
105
+ stable hostnames based on their index.
106
+
107
+ The problem with using a Service per index is twofold:
108
+ - Extra latency: traffic has to go through iptables rules so that service VIP
109
+ gets replaced with the actual Pod IP, for every packet sent or received.
110
+ - Scale problems: each Service creates an associated Endpoint resource,
111
+ moreover, they require programming on each node, hence generating lots of
112
+ control traffic.
113
+
114
+ Creating Pods with stable hostnames doesn't have these problems. Pods can
115
+ address each other with a DNS lookup and communicate directly using Pod IPs.
116
+ A popular operator chose to use a StatefulSet to handle Pod creation and
117
+ management with these characteristics. Due to limitations, the operator now
118
+ manages plain pods. These limitations of StatefulSet were:
119
+ - Pods are created serially.
120
+ - Pods can be replaced without leaving notice of failures.
121
+ - Pods cannot run to completion (containers restart on success or failure).
122
+
97
123
[ Job patterns ] : https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns
98
124
99
125
### Goals
100
126
101
127
- Support the * indexed Job* pattern by adding completion indexes to each Pod
102
128
of a Job in * fixed completion count* mode.
129
+ - Add stable hostnames to Pods based on the index to simplify communication
130
+ among themselves.
103
131
104
132
### Non-Goals
105
133
106
134
- Support for work lists, where each Pod receives a different element of a
107
135
static list. This can be implemented by users from completion indexes.
108
136
- Support for completion index in non-parallel Jobs or Jobs with a work queue.
137
+ - Network programming for indexed Jobs. This is left to headless Services.
109
138
- All-or-nothing scheduling.
110
139
111
140
## Proposal
@@ -114,29 +143,62 @@ for Job queueing or workflows to support all of them.
114
143
115
144
#### Story 1
116
145
117
- As a Job author, I can create an array Job where each Pod receives an ordered
146
+ As a Job author, I can create an Indexed Job where each Pod receives an ordered
118
147
completion index. I can use the index in my binary through an environment
119
148
variable or a file to statically select the load the Pod should work on.
120
149
121
150
``` yaml
122
151
apiVersion : batch/v1
123
152
kind : Job
124
153
metadata :
125
- name : parallel-work
154
+ name : my-job
126
155
spec :
127
156
completions : 100
128
157
parallelism : 100
158
+ completionMode : Indexed
129
159
template :
130
160
spec :
131
161
containers :
132
162
- name : task
133
163
image : registry.example.com/processing-image
134
- command : ["./process", "--index", "$INDEX"]
135
- env :
136
- - name : INDEX
137
- valueFrom :
138
- fieldRef :
139
- fieldPath : metadata.annotations['batch.kubernetes.io/job-completion-index']
164
+ command : ["./process", "--index", "$JOB_COMPLETION_INDEX"]
165
+ ` ` `
166
+
167
+ #### Story 2
168
+
169
+ As a Job author, I can create an Indexed Job where pods can address each other
170
+ by the hostname that can be built from the index.
171
+
172
+ ` ` ` yaml
173
+ apiVersion : batch/v1
174
+ kind : Job
175
+ metadata :
176
+ name : my-job
177
+ spec :
178
+ completions : 100
179
+ parallelism : 100
180
+ completionMode : Indexed
181
+ template :
182
+ metadata :
183
+ labels :
184
+ job : my-job
185
+ spec :
186
+ subdomain : my-job-svc
187
+ containers :
188
+ - name : task
189
+ image : registry.example.com/processing-image
190
+ command : ["./process", "--index", "$JOB_COMPLETION_INDEX", "--hosts-pattern", "my-job-{{.id}}.my-job-svc"]
191
+ ` ` `
192
+
193
+ ` ` ` yaml
194
+ apiVersion : v1
195
+ kind : Service
196
+ metadata :
197
+ name : my-job-svc
198
+ spec :
199
+ clusterIP : None
200
+ selector :
201
+ job : my-job
140
202
` ` `
141
203
142
204
### Notes/Constraints/Caveats (Optional)
@@ -148,19 +210,35 @@ because work lists can be implemented in a startup script using the completion
148
210
index as building block.
149
211
* The semantics of an indexed Job are similar to a StatefulSet, in the sense
150
212
that Pods have an associated index.
151
- However, the APIs have a major difference: a StatefulSet doesn't have completion
152
- semantics, as opposed to Jobs.
213
+ However, the APIs have major differences:
214
+ - a StatefulSet doesn't have completion semantics, as opposed to Jobs.
215
+ - a StatefulSet creates pods serially, whereas Job creates all Pods in
216
+ parallel.
153
217
154
218
[indexed Job]: https://github.com/kubernetes/community/blob/b21d1b27c8c748bf81283c2d89cde2becb5f2709/contributors/design-proposals/apps/indexed-job.md
155
219
156
220
### Risks and Mitigations
157
221
158
- Jobs have a known issue in which more than one Pod can be started even if
159
- parallelism and completion are set to 1 ([reference]). In the case of indexed
160
- Jobs, this translates to more than one Pod having the same index.
222
+ - More than one pod created per index.
223
+
224
+ Jobs have a known issue in which more than one Pod can be started even if
225
+ parallelism and completion are set to 1 ([reference]). In the case of indexed
226
+ Jobs, this translates to more than one Pod having the same index.
227
+
228
+ Just like for existing Job patterns, workloads have to handle duplicates at the
229
+ application level.
230
+
231
+ - Scalability and latency of DNS programming.
161
232
162
- Just like for existing Job patterns, workloads have to handle duplicates at the
163
- application level.
233
+ DNS programming requires the update of EndpointSlices and writing DNS records.
234
+ This might not scale well for short-lived Jobs with high number of parallelism.
235
+ Moreoever, Pods need to be prepared to retry lookups in the case were the
236
+ records didn't have time to update.
237
+
238
+ However, network programming is opt-in (users need to create a matching
239
+ headless Service). Moreover, workloads have other means of obtaining IPs,
240
+ such as querying/watching the API server. Vendors can also choose to implement
241
+ alternate DNS programming tailored for Jobs.
164
242
165
243
[reference]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures
166
244
@@ -197,6 +275,8 @@ type JobSpec struct {
197
275
// `Indexed` means that the Pods of a
198
276
// Job get an associated completion index from 0 to (.spec.completions - 1),
199
277
// available in the annotation batch.kubernetes.io/job-completion-index.
278
+ // The Pod hostnames are set to $(job-name)-$(index) and the names to
279
+ // $(job-name)-$(index)-$(random-suffix).
200
280
// The Job is considered complete when there is one successfully completed Pod
201
281
// for each index.
202
282
// When value is `Indexed`, .spec.completions must be specified and
@@ -259,6 +339,13 @@ The Job controller doesn't add the environment variable if there is a name
259
339
conflict with an existing environment variable. Users can specify other
260
340
environment variables for the same annotation.
261
341
342
+ The Pod name takes the form ` $(job-name)-$(index)-$(random-string)`,
343
+ which can be used for quickly identifying Pods for a specific index.
344
+
345
+ The Pod hostname takes the form `$(job-name)-$(index)` which can be used to
346
+ address the Pod from others, when the Job is used in combination with a headless
347
+ Service.
348
+
262
349
# ## Job completion and restart policy
263
350
264
351
When dealing with Indexed Jobs, the Job controller keeps track of Pod
@@ -317,7 +404,7 @@ Reducing parallelism is unaffected by completion index.
317
404
318
405
Unit, integration and E2E tests cover the following Indexed Job mechanics :
319
406
320
- - Creation with indexed Pod names and index annotations .
407
+ - Creation with index annotations and indexed pod hostnames .
321
408
- Scale up and down.
322
409
- Pod failures.
323
410
@@ -335,6 +422,7 @@ gate enabled and disabled.
335
422
# ### Alpha -> Beta Graduation
336
423
337
424
- Complete features :
425
+ - Index as part of the pod name and hostname.
338
426
- Indexed Jobs when tracking completion with finalizers.
339
427
[kubernetes/enhancements#2307](https://github.com/kubernetes/enhancements/issues/2307).
340
428
@@ -517,7 +605,8 @@ the existing API objects?**
517
605
than 1MB.
518
606
519
607
- API type(s) : Pod, only when created with the new completion mode.
520
- - Estimated increase in size : new annotation of about 50 bytes.
608
+ - Estimated increase in size : new annotation of about 50 bytes and hostname
609
+ which includes the index.
521
610
522
611
* **Will enabling / using this feature result in increasing time taken by any
523
612
operations covered by [existing SLIs/SLOs]?**
0 commit comments