@@ -63,7 +63,8 @@ This KEP extends kubernetes with user-friendly support for running parallel jobs
63
63
64
64
Here, parallel means multiple pods per Job. Jobs can be:
65
65
- Embarrassingly parallel, where the pods have no dependencies between each other.
66
- - Tightly coupled, where the Pods communicate among themselves to make progress.
66
+ - Tightly coupled, where the Pods communicate among themselves to make progress
67
+ (kubernetes/kubernetes #99497 )[ https://github.com/kubernetes/kubernetes/issues/99497 ]
67
68
68
69
We propose the addition of completion indexes into the Pods of a * Job
69
70
[ with fixed completion count] * to support running embarrassingly parallel
@@ -223,7 +224,7 @@ However, the APIs have major differences:
223
224
224
225
- More than one pod created per index.
225
226
226
- Jobs have a known issue in which more than one Pod can be started even if
227
+ Jobs have a known rare issue in which more than one Pod can be started even if
227
228
parallelism and completion are set to 1 ([reference]). In the case of indexed
228
229
Jobs, this translates to more than one Pod having the same index.
229
230
@@ -232,10 +233,21 @@ However, the APIs have major differences:
232
233
233
234
- Scalability and latency of DNS programming.
234
235
235
- DNS programming requires the update of EndpointSlices and writing DNS records.
236
+ DNS programming requires the update of EndpointSlices by the endpoint
237
+ controller and updating DNS records by the DNS provider.
236
238
This might not scale well for short-lived Jobs with high number of parallelism.
237
- Moreoever, Pods need to be prepared to retry lookups in the case were the
238
- records didn't have time to update.
239
+ Moreoever, Pods need to be prepared to:
240
+ - Retry lookups in the case were the records didn't have time to update.
241
+ - Handle more than one IP for the CNAME. This might happen temporarily when:
242
+ - the job controller creates more than one pod per index or
243
+ - the job controller creates a replacement of a failed Pod before the DNS
244
+ provider clears the record for the failed pod. This will be uncommon
245
+ as the endpoint controller should see the failed Pod before it sees the
246
+ replacement Pod.
247
+ <UNRESOLVED>
248
+ The recommendation for applications is to request a new DNS resolution until
249
+ the DNS server returns one IP.
250
+ </UNRESOLVED>
239
251
240
252
However, network programming is opt-in (users need to create a matching
241
253
headless Service). Moreover, workloads have other means of obtaining IPs,
@@ -682,11 +694,9 @@ _This section must be completed when targeting beta graduation to a release._
682
694
683
695
Completion indexes could also be part of the Pod name, leading to stable Pod
684
696
names. This allows 2 things :
685
- - Uniqueness for each completion index, freeing applications from having to
686
- handle duplicated indexes.
687
- - Predictable hostnames, which benefits applications that need to communicate
688
- to Pods of a Job (or among Pods of the same Job) without having to do
689
- discovery.
697
+ - Uniqueness for each completion index. This frees applications from having to
698
+ handle duplicated indexes. When used along with a headless Service, there
699
+ are less chances for a DNS record to refer to more than one Pod.
690
700
691
701
Stable pod names require the Job controller to remove failed Pods before
692
702
creating a new one with the same index. This has some downsides :
@@ -696,11 +706,9 @@ _This section must be completed when targeting beta graduation to a release._
696
706
the status of the Job, affecting retry backoffs and backoff limit. This
697
707
needs to change before stable Pod names can be implemented
698
708
[#28486](https://github.com/kubernetes/kubernetes/issues/28486).
699
- - Reduced availability of Job Pods per completion index. This happens when
700
- a Node becomes unavailable. The Job controller cannot remove such Pods.
701
- Either the kubelet in the Node recovers and marks the Pod as failed; or the
702
- kube-apiserver removes the Node and the garbage collector removes the orphan
703
- Pods.
709
+ - Reduced availability of Job Pods per completion index as, in addition to
710
+ the time necessary to create a new Pod, we need to account for the time of
711
+ deleting the failed Pod.
704
712
705
713
However, stable Pod names can be offered later as a new value for
706
714
` .spec.completionMode` for Jobs.
0 commit comments