Skip to content

Commit 3bd6507

Browse files
Add stable hostnames to Indexed Job
as part of Beta graduation.
1 parent ebce933 commit 3bd6507

File tree

1 file changed

+128
-28
lines changed

1 file changed

+128
-28
lines changed

keps/sig-apps/2214-indexed-job/README.md

Lines changed: 128 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@
99
- [Proposal](#proposal)
1010
- [User Stories (Optional)](#user-stories-optional)
1111
- [Story 1](#story-1)
12+
- [Story 2](#story-2)
1213
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
1314
- [Risks and Mitigations](#risks-and-mitigations)
1415
- [Design Details](#design-details)
@@ -58,19 +59,21 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
5859

5960
## Summary
6061

61-
This KEP extends kubernetes with user-friendly support for running
62-
embarrassingly parallel jobs.
62+
This KEP extends kubernetes with user-friendly support for running parallel jobs.
6363

64-
Here, parallel means multiple pods. By embarrassingly parallel, it means that
65-
the pods have no dependencies between each other.
66-
In particular, neither ordering between pods nor gang scheduling are supported.
64+
Here, parallel means multiple pods per Job. Jobs can be:
65+
- Embarrassingly parallel, where the pods have no dependencies between each other.
66+
- Tightly coupled, where the Pods communicate among themselves to make progress
67+
(kubernetes/kubernetes#99497)[https://github.com/kubernetes/kubernetes/issues/99497]
6768

6869
We propose the addition of completion indexes into the Pods of a *Job
6970
[with fixed completion count]* to support running embarrassingly parallel
70-
programs, with a focus on ease of use.
71+
programs, with a focus on ease of use for workload partitioning.
7172
We call this new Job pattern an *Indexed Job*, because each Pod of the Job
7273
specializes to work on a particular index, as if the Pods where elements of an
7374
array.
75+
With the addition of a headless Service, Pods can address another Pod with a
76+
specific index with a DNS lookup, because the index is part of the hostname.
7477

7578
[with fixed completion count]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs
7679

@@ -94,18 +97,45 @@ own APIs and controllers or adopt third party implementations. Each
9497
implementation splits the ecosystem, making it harder for higher level systems
9598
for Job queueing or workflows to support all of them.
9699

100+
Additionally, the Pods within a Job can't easily address and communicate with
101+
each other, making it hard to run tightly coupled parallel Jobs using the Job
102+
API.
103+
104+
Third-party operators cover these use cases by defining their own APIs, leading
105+
to fragmentation of the ecosystem. The operators use mainly two networking
106+
patterns: (1) fronting each index with a Service or (2) creating Pods with
107+
stable hostnames based on their index.
108+
109+
Using a Service per index has scalability problems. Other than the Service
110+
objects themselves, the control plane creates an Endpoint object.
111+
112+
Creating Pods with stable hostnames mitigates this problem. The control plane
113+
requires only one Service and one Endpoint (or a few EndpointSlices) to inform
114+
the DNS programming. Pods can address each other with a DNS lookup and
115+
communicate directly using Pod IPs.
116+
117+
A popular operator chose to use a StatefulSet to handle Pod creation and
118+
management with these characteristics. Due to limitations, the operator now
119+
manages plain pods. These limitations of StatefulSet were:
120+
- Pods are created serially.
121+
- Pods can be replaced without leaving notice of failures.
122+
- Pods cannot run to completion (containers restart on success or failure).
123+
97124
[Job patterns]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns
98125

99126
### Goals
100127

101128
- Support the *indexed Job* pattern by adding completion indexes to each Pod
102129
of a Job in *fixed completion count* mode.
130+
- Add stable hostnames to Pods based on the index to simplify communication
131+
among themselves.
103132

104133
### Non-Goals
105134

106135
- Support for work lists, where each Pod receives a different element of a
107136
static list. This can be implemented by users from completion indexes.
108137
- Support for completion index in non-parallel Jobs or Jobs with a work queue.
138+
- Network programming for indexed Jobs. This is left to headless Services.
109139
- All-or-nothing scheduling.
110140

111141
## Proposal
@@ -114,29 +144,62 @@ for Job queueing or workflows to support all of them.
114144

115145
#### Story 1
116146

117-
As a Job author, I can create an array Job where each Pod receives an ordered
147+
As a Job author, I can create an Indexed Job where each Pod receives an ordered
118148
completion index. I can use the index in my binary through an environment
119149
variable or a file to statically select the load the Pod should work on.
120150

121151
```yaml
122152
apiVersion: batch/v1
123153
kind: Job
124154
metadata:
125-
name: parallel-work
155+
name: my-job
126156
spec:
127157
completions: 100
128158
parallelism: 100
159+
completionMode: Indexed
129160
template:
130161
spec:
131162
containers:
132163
- name: task
133164
image: registry.example.com/processing-image
134-
command: ["./process", "--index", "$INDEX"]
135-
env:
136-
- name: INDEX
137-
valueFrom:
138-
fieldRef:
139-
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
165+
command: ["./process", "--index", "$JOB_COMPLETION_INDEX"]
166+
```
167+
168+
#### Story 2
169+
170+
As a Job author, I can create an Indexed Job where pods can address each other
171+
by the hostname that can be built from the index.
172+
173+
```yaml
174+
apiVersion: batch/v1
175+
kind: Job
176+
metadata:
177+
name: my-job
178+
spec:
179+
completions: 100
180+
parallelism: 100
181+
completionMode: Indexed
182+
template:
183+
metadata:
184+
labels:
185+
job: my-job
186+
spec:
187+
subdomain: my-job-svc
188+
containers:
189+
- name: task
190+
image: registry.example.com/processing-image
191+
command: ["./process", "--index", "$JOB_COMPLETION_INDEX", "--hosts-pattern", "my-job-{{.id}}.my-job-svc"]
192+
```
193+
194+
```yaml
195+
apiVersion: v1
196+
kind: Service
197+
metadata:
198+
name: my-job-svc
199+
spec:
200+
clusterIP: None
201+
selector:
202+
job: my-job
140203
```
141204
142205
### Notes/Constraints/Caveats (Optional)
@@ -148,14 +211,18 @@ because work lists can be implemented in a startup script using the completion
148211
index as building block.
149212
* The semantics of an indexed Job are similar to a StatefulSet, in the sense
150213
that Pods have an associated index.
151-
However, the APIs have a major difference: a StatefulSet doesn't have completion
152-
semantics, as opposed to Jobs.
214+
However, the APIs have major differences:
215+
- a StatefulSet doesn't have completion semantics, as opposed to Jobs.
216+
- a StatefulSet creates pods serially, whereas Job creates all Pods in
217+
parallel.
218+
- a StatefulSet gives Pods stable hostnames, a Job doesn't.
153219
154220
[indexed Job]: https://github.com/kubernetes/community/blob/b21d1b27c8c748bf81283c2d89cde2becb5f2709/contributors/design-proposals/apps/indexed-job.md
155221
156222
### Risks and Mitigations
157223
158224
- More than one pod per index
225+
159226
Jobs have a known issue in which more than one Pod can be started even if
160227
parallelism and completion are set to 1 ([reference]). In the case of indexed
161228
Jobs, this translates to more than one Pod having the same index.
@@ -172,6 +239,29 @@ semantics, as opposed to Jobs.
172239
Pods. The controller processes the remaining operations in subsequent syncs,
173240
which it schedules with no delay.
174241
242+
- Scalability and latency of DNS programming, if users choose to pair the
243+
Indexed Job with a headless service.
244+
245+
DNS programming requires the update of Endpoint or EndpointSlices by the
246+
control plane and updating DNS records by the DNS provider.
247+
This might not scale well for short-lived Jobs with high number of
248+
parallelism.
249+
250+
Thus, Pods need to be prepared to:
251+
- Retry lookups, when the control plane didn't have time to update the records.
252+
- Handle the IPs for a CNAME to change, in the case of a Pod failure.
253+
- Handle more than one IP for the CNAME. This might happen temporarily when
254+
the job controller creates more than one pod per index. The controller
255+
corrects this in the next sync, deleting the Pod that started last, which
256+
should correspond to the last IP added to the record.
257+
In short, Pods are ephemeral and resolutions might change, so users shouldn't
258+
rely on DNS caches.
259+
260+
However, DNS programming is opt-in (users need to create a matching
261+
headless Service). Moreover, workloads have other means of obtaining IPs,
262+
such as querying/watching the API server. Vendors can also choose to implement
263+
alternate DNS programming tailored for Jobs.
264+
175265
[reference]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures
176266
177267
## Design Details
@@ -207,6 +297,8 @@ type JobSpec struct {
207297
// `Indexed` means that the Pods of a
208298
// Job get an associated completion index from 0 to (.spec.completions - 1),
209299
// available in the annotation batch.kubernetes.io/job-completion-index.
300+
// The Pod hostnames are set to $(job-name)-$(index) and the names to
301+
// $(job-name)-$(index)-$(random-suffix).
210302
// The Job is considered complete when there is one successfully completed Pod
211303
// for each index.
212304
// When value is `Indexed`, .spec.completions must be specified and
@@ -269,6 +361,16 @@ The Job controller doesn't add the environment variable if there is a name
269361
conflict with an existing environment variable. Users can specify other
270362
environment variables for the same annotation.
271363
364+
<<[UNRESOLVED this deviates from the rest of the controllers ]>>
365+
The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,
366+
which can be used for quickly identifying Pods for a specific index when listing
367+
pods or looking at logs.
368+
<<[/UNRESOLVED]>>
369+
370+
The Pod hostname takes the form `$(job-name)-$(index)` which can be used to
371+
address the Pod from others, when the Job is used in combination with a headless
372+
Service.
373+
272374
### Job completion and restart policy
273375

274376
When dealing with Indexed Jobs, the Job controller keeps track of Pod
@@ -327,7 +429,7 @@ Reducing parallelism is unaffected by completion index.
327429

328430
Unit, integration and E2E tests cover the following Indexed Job mechanics:
329431

330-
- Creation with indexed Pod names and index annotations.
432+
- Creation with index annotations and indexed pod hostnames.
331433
- Scale up and down.
332434
- Pod failures.
333435

@@ -345,6 +447,7 @@ gate enabled and disabled.
345447
#### Alpha -> Beta Graduation
346448

347449
- Complete features:
450+
- Index as part of the pod name and hostname.
348451
- Indexed Jobs when tracking completion with finalizers.
349452
[kubernetes/enhancements#2307](https://github.com/kubernetes/enhancements/issues/2307).
350453

@@ -534,7 +637,8 @@ the existing API objects?**
534637
than 1MB.
535638

536639
- API type(s): Pod, only when created with the new completion mode.
537-
- Estimated increase in size: new annotation of about 50 bytes.
640+
- Estimated increase in size: new annotation of about 50 bytes and hostname
641+
which includes the index.
538642

539643
* **Will enabling / using this feature result in increasing time taken by any
540644
operations covered by [existing SLIs/SLOs]?**
@@ -606,11 +710,9 @@ _This section must be completed when targeting beta graduation to a release._
606710

607711
Completion indexes could also be part of the Pod name, leading to stable Pod
608712
names. This allows 2 things:
609-
- Uniqueness for each completion index, freeing applications from having to
610-
handle duplicated indexes.
611-
- Predictable hostnames, which benefits applications that need to communicate
612-
to Pods of a Job (or among Pods of the same Job) without having to do
613-
discovery.
713+
- Uniqueness for each completion index. This frees applications from having to
714+
handle duplicated indexes. When used along with a headless Service, there
715+
are less chances for a DNS record to refer to more than one Pod.
614716

615717
Stable pod names require the Job controller to remove failed Pods before
616718
creating a new one with the same index. This has some downsides:
@@ -620,11 +722,9 @@ _This section must be completed when targeting beta graduation to a release._
620722
the status of the Job, affecting retry backoffs and backoff limit. This
621723
needs to change before stable Pod names can be implemented
622724
[#28486](https://github.com/kubernetes/kubernetes/issues/28486).
623-
- Reduced availability of Job Pods per completion index. This happens when
624-
a Node becomes unavailable. The Job controller cannot remove such Pods.
625-
Either the kubelet in the Node recovers and marks the Pod as failed; or the
626-
kube-apiserver removes the Node and the garbage collector removes the orphan
627-
Pods.
725+
- Reduced availability of Job Pods per completion index as, in addition to
726+
the time necessary to create a new Pod, we need to account for the time of
727+
deleting the failed Pod.
628728

629729
However, stable Pod names can be offered later as a new value for
630730
`.spec.completionMode` for Jobs.

0 commit comments

Comments
 (0)