Skip to content

Commit 190f20b

Browse files
Add stable hostnames to Indexed Job
as part of Beta graduation. Signed-off-by: Aldo Culquicondor <[email protected]>
1 parent 1f145b5 commit 190f20b

File tree

1 file changed

+112
-23
lines changed

1 file changed

+112
-23
lines changed

keps/sig-apps/2214-indexed-job/README.md

Lines changed: 112 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -58,19 +58,20 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
5858

5959
## Summary
6060

61-
This KEP extends kubernetes with user-friendly support for running
62-
embarrassingly parallel jobs.
61+
This KEP extends kubernetes with user-friendly support for running parallel jobs.
6362

64-
Here, parallel means multiple pods. By embarrassingly parallel, it means that
65-
the pods have no dependencies between each other.
66-
In particular, neither ordering between pods nor gang scheduling are supported.
63+
Here, parallel means multiple pods per Job. Jobs can be:
64+
- Embarrassingly parallel, where the pods have no dependencies between each other.
65+
- Tightly coupled, where the Pods communicate among themselves to make progress.
6766

6867
We propose the addition of completion indexes into the Pods of a *Job
6968
[with fixed completion count]* to support running embarrassingly parallel
70-
programs, with a focus on ease of use.
69+
programs, with a focus on ease of use for workload partitioning.
7170
We call this new Job pattern an *Indexed Job*, because each Pod of the Job
7271
specializes to work on a particular index, as if the Pods where elements of an
7372
array.
73+
With the addition of a headless Service, Pods can address another Pod with a
74+
specific index with a DNS lookup, because the index is part of the hostname.
7475

7576
[with fixed completion count]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs
7677

@@ -94,18 +95,46 @@ own APIs and controllers or adopt third party implementations. Each
9495
implementation splits the ecosystem, making it harder for higher level systems
9596
for Job queueing or workflows to support all of them.
9697

98+
Additionally, the Pods within a Job can't easily address and communicate with
99+
each other, making it hard to run tightly coupled parallel Jobs using the Job
100+
API.
101+
102+
Third-party operators cover these use cases by defining their own APIs, leading
103+
to fragmentation of the ecosystem. The operators use mainly two networking
104+
patterns: (1) fronting each index with a Service or (2) creating Pods with
105+
stable hostnames based on their index.
106+
107+
The problem with using a Service per index is twofold:
108+
- Extra latency: traffic has to go through iptables rules so that service VIP
109+
gets replaced with the actual Pod IP, for every packet sent or received.
110+
- Scale problems: each Service creates an associated Endpoint resource,
111+
moreover, they require programming on each node, hence generating lots of
112+
control traffic.
113+
114+
Creating Pods with stable hostnames doesn't have these problems. Pods can
115+
address each other with a DNS lookup and communicate directly using Pod IPs.
116+
A popular operator chose to use a StatefulSet to handle Pod creation and
117+
management with these characteristics. Due to limitations, the operator now
118+
manages plain pods. These limitations of StatefulSet were:
119+
- Pods are created serially.
120+
- Pods can be replaced without leaving notice of failures.
121+
- Pods cannot run to completion (containers restart on success or failure).
122+
97123
[Job patterns]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns
98124

99125
### Goals
100126

101127
- Support the *indexed Job* pattern by adding completion indexes to each Pod
102128
of a Job in *fixed completion count* mode.
129+
- Add stable hostnames to Pods based on the index to simplify communication
130+
among themselves.
103131

104132
### Non-Goals
105133

106134
- Support for work lists, where each Pod receives a different element of a
107135
static list. This can be implemented by users from completion indexes.
108136
- Support for completion index in non-parallel Jobs or Jobs with a work queue.
137+
- Network programming for indexed Jobs. This is left to headless Services.
109138
- All-or-nothing scheduling.
110139

111140
## Proposal
@@ -114,29 +143,62 @@ for Job queueing or workflows to support all of them.
114143

115144
#### Story 1
116145

117-
As a Job author, I can create an array Job where each Pod receives an ordered
146+
As a Job author, I can create an Indexed Job where each Pod receives an ordered
118147
completion index. I can use the index in my binary through an environment
119148
variable or a file to statically select the load the Pod should work on.
120149

121150
```yaml
122151
apiVersion: batch/v1
123152
kind: Job
124153
metadata:
125-
name: parallel-work
154+
name: my-job
126155
spec:
127156
completions: 100
128157
parallelism: 100
158+
completionMode: Indexed
129159
template:
130160
spec:
131161
containers:
132162
- name: task
133163
image: registry.example.com/processing-image
134-
command: ["./process", "--index", "$INDEX"]
135-
env:
136-
- name: INDEX
137-
valueFrom:
138-
fieldRef:
139-
fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index']
164+
command: ["./process", "--index", "$JOB_COMPLETION_INDEX"]
165+
```
166+
167+
#### Story 2
168+
169+
As a Job author, I can create an Indexed Job where pods can address each other
170+
by the hostname that can be built from the index.
171+
172+
```yaml
173+
apiVersion: batch/v1
174+
kind: Job
175+
metadata:
176+
name: my-job
177+
spec:
178+
completions: 100
179+
parallelism: 100
180+
completionMode: Indexed
181+
template:
182+
metadata:
183+
labels:
184+
job: my-job
185+
spec:
186+
subdomain: my-job-svc
187+
containers:
188+
- name: task
189+
image: registry.example.com/processing-image
190+
command: ["./process", "--index", "$JOB_COMPLETION_INDEX", "--hosts-pattern", "my-job-{{.id}}.my-job-svc"]
191+
```
192+
193+
```yaml
194+
apiVersion: v1
195+
kind: Service
196+
metadata:
197+
name: my-job-svc
198+
spec:
199+
clusterIP: None
200+
selector:
201+
job: my-job
140202
```
141203
142204
### Notes/Constraints/Caveats (Optional)
@@ -148,19 +210,35 @@ because work lists can be implemented in a startup script using the completion
148210
index as building block.
149211
* The semantics of an indexed Job are similar to a StatefulSet, in the sense
150212
that Pods have an associated index.
151-
However, the APIs have a major difference: a StatefulSet doesn't have completion
152-
semantics, as opposed to Jobs.
213+
However, the APIs have major differences:
214+
- a StatefulSet doesn't have completion semantics, as opposed to Jobs.
215+
- a StatefulSet creates pods serially, whereas Job creates all Pods in
216+
parallel.
153217
154218
[indexed Job]: https://github.com/kubernetes/community/blob/b21d1b27c8c748bf81283c2d89cde2becb5f2709/contributors/design-proposals/apps/indexed-job.md
155219
156220
### Risks and Mitigations
157221
158-
Jobs have a known issue in which more than one Pod can be started even if
159-
parallelism and completion are set to 1 ([reference]). In the case of indexed
160-
Jobs, this translates to more than one Pod having the same index.
222+
- More than one pod created per index.
223+
224+
Jobs have a known issue in which more than one Pod can be started even if
225+
parallelism and completion are set to 1 ([reference]). In the case of indexed
226+
Jobs, this translates to more than one Pod having the same index.
227+
228+
Just like for existing Job patterns, workloads have to handle duplicates at the
229+
application level.
230+
231+
- Scalability and latency of DNS programming.
161232
162-
Just like for existing Job patterns, workloads have to handle duplicates at the
163-
application level.
233+
DNS programming requires the update of EndpointSlices and writing DNS records.
234+
This might not scale well for short-lived Jobs with high number of parallelism.
235+
Moreoever, Pods need to be prepared to retry lookups in the case were the
236+
records didn't have time to update.
237+
238+
However, network programming is opt-in (users need to create a matching
239+
headless Service). Moreover, workloads have other means of obtaining IPs,
240+
such as querying/watching the API server. Vendors can also choose to implement
241+
alternate DNS programming tailored for Jobs.
164242
165243
[reference]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures
166244
@@ -197,6 +275,8 @@ type JobSpec struct {
197275
// `Indexed` means that the Pods of a
198276
// Job get an associated completion index from 0 to (.spec.completions - 1),
199277
// available in the annotation batch.kubernetes.io/job-completion-index.
278+
// The Pod hostnames are set to $(job-name)-$(index) and the names to
279+
// $(job-name)-$(index)-$(random-suffix).
200280
// The Job is considered complete when there is one successfully completed Pod
201281
// for each index.
202282
// When value is `Indexed`, .spec.completions must be specified and
@@ -259,6 +339,13 @@ The Job controller doesn't add the environment variable if there is a name
259339
conflict with an existing environment variable. Users can specify other
260340
environment variables for the same annotation.
261341
342+
The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,
343+
which can be used for quickly identifying Pods for a specific index.
344+
345+
The Pod hostname takes the form `$(job-name)-$(index)` which can be used to
346+
address the Pod from others, when the Job is used in combination with a headless
347+
Service.
348+
262349
### Job completion and restart policy
263350

264351
When dealing with Indexed Jobs, the Job controller keeps track of Pod
@@ -317,7 +404,7 @@ Reducing parallelism is unaffected by completion index.
317404

318405
Unit, integration and E2E tests cover the following Indexed Job mechanics:
319406

320-
- Creation with indexed Pod names and index annotations.
407+
- Creation with index annotations and indexed pod hostnames.
321408
- Scale up and down.
322409
- Pod failures.
323410

@@ -335,6 +422,7 @@ gate enabled and disabled.
335422
#### Alpha -> Beta Graduation
336423

337424
- Complete features:
425+
- Index as part of the pod name and hostname.
338426
- Indexed Jobs when tracking completion with finalizers.
339427
[kubernetes/enhancements#2307](https://github.com/kubernetes/enhancements/issues/2307).
340428

@@ -517,7 +605,8 @@ the existing API objects?**
517605
than 1MB.
518606

519607
- API type(s): Pod, only when created with the new completion mode.
520-
- Estimated increase in size: new annotation of about 50 bytes.
608+
- Estimated increase in size: new annotation of about 50 bytes and hostname
609+
which includes the index.
521610

522611
* **Will enabling / using this feature result in increasing time taken by any
523612
operations covered by [existing SLIs/SLOs]?**

0 commit comments

Comments
 (0)