Add stable hostnames to Indexed Job

alculquicondor · alculquicondor · commit 190f20b566df · 2021-04-15T11:39:25.000-04:00
as part of Beta graduation.

Signed-off-by: Aldo Culquicondor &lt;acondor@google.com&gt;
diff --git a/keps/sig-apps/2214-indexed-job/README.md b/keps/sig-apps/2214-indexed-job/README.md
@@ -58,19 +58,20 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
 
 ## Summary
 
-This KEP extends kubernetes with user-friendly support for running
-embarrassingly parallel jobs.
+This KEP extends kubernetes with user-friendly support for running parallel jobs.
 
-Here, parallel means multiple pods. By embarrassingly parallel, it means that
-the pods have no dependencies between each other.
-In particular, neither ordering between pods nor gang scheduling are supported.
+Here, parallel means multiple pods per Job. Jobs can be:
+- Embarrassingly parallel, where the pods have no dependencies between each other.
+- Tightly coupled, where the Pods communicate among themselves to make progress.
 
 We propose the addition of completion indexes into the Pods of a *Job
 [with fixed completion count]* to support running embarrassingly parallel
-programs, with a focus on ease of use.
+programs, with a focus on ease of use for workload partitioning.
 We call this new Job pattern an *Indexed Job*, because each Pod of the Job
 specializes to work on a particular index, as if the Pods where elements of an
 array.
+With the addition of a headless Service, Pods can address another Pod with a
+specific index with a DNS lookup, because the index is part of the hostname.
 
 [with fixed completion count]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#parallel-jobs
 
@@ -94,18 +95,46 @@ own APIs and controllers or adopt third party implementations. Each
 implementation splits the ecosystem, making it harder for higher level systems
 for Job queueing or workflows to support all of them.
 
+Additionally, the Pods within a Job can't easily address and communicate with
+each other, making it hard to run tightly coupled parallel Jobs using the Job
+API.
+
+Third-party operators cover these use cases by defining their own APIs, leading
+to fragmentation of the ecosystem. The operators use mainly two networking
+patterns: (1) fronting each index with a Service or (2) creating Pods with
+stable hostnames based on their index.
+
+The problem with using a Service per index is twofold:
+- Extra latency: traffic has to go through iptables rules so that service VIP
+  gets replaced with the actual Pod IP, for every packet sent or received.
+- Scale problems: each Service creates an associated Endpoint resource,
+  moreover, they require programming on each node, hence generating lots of
+  control traffic.
+
+Creating Pods with stable hostnames doesn't have these problems. Pods can
+address each other with a DNS lookup and communicate directly using Pod IPs.
+A popular operator chose to use a StatefulSet to handle Pod creation and
+management with these characteristics. Due to limitations, the operator now
+manages plain pods. These limitations of StatefulSet were:
+- Pods are created serially.
+- Pods can be replaced without leaving notice of failures.
+- Pods cannot run to completion (containers restart on success or failure).
+
 [Job patterns]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-patterns
 
 ### Goals
 
 - Support the *indexed Job* pattern by adding completion indexes to each Pod
   of a Job in *fixed completion count* mode.
+- Add stable hostnames to Pods based on the index to simplify communication 
+  among themselves.
 
 ### Non-Goals
 
 - Support for work lists, where each Pod receives a different element of a
   static list. This can be implemented by users from completion indexes.
 - Support for completion index in non-parallel Jobs or Jobs with a work queue.
+- Network programming for indexed Jobs. This is left to headless Services.
 - All-or-nothing scheduling.
 
 ## Proposal
@@ -114,29 +143,62 @@ for Job queueing or workflows to support all of them.
 
 #### Story 1
 
-As a Job author, I can create an array Job where each Pod receives an ordered
+As a Job author, I can create an Indexed Job where each Pod receives an ordered
 completion index. I can use the index in my binary through an environment
 variable or a file to statically select the load the Pod should work on.
 
 ```yaml
 apiVersion: batch/v1
 kind: Job
 metadata:
-  name: parallel-work
+  name: my-job
 spec:
   completions: 100
   parallelism: 100
+  completionMode: Indexed
   template:
     spec:
       containers:
       - name: task
         image: registry.example.com/processing-image
-        command: ["./process",  "--index", "$INDEX"]
-        env:
-        - name: INDEX
-          valueFrom:
-            fieldRef:
-              fieldPath: metadata.annotations['batch.kubernetes.io/job-completion-index'] 
+        command: ["./process",  "--index", "$JOB_COMPLETION_INDEX"]
+```
+
+#### Story 2
+
+As a Job author, I can create an Indexed Job where pods can address each other
+by the hostname that can be built from the index.
+
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: my-job
+spec:
+  completions: 100
+  parallelism: 100
+  completionMode: Indexed
+  template:
+    metadata:
+      labels:
+        job: my-job
+    spec:
+      subdomain: my-job-svc
+      containers:
+      - name: task
+        image: registry.example.com/processing-image
+        command: ["./process",  "--index", "$JOB_COMPLETION_INDEX", "--hosts-pattern", "my-job-{{.id}}.my-job-svc"]
+```
+
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: my-job-svc
+spec:
+  clusterIP: None
+  selector:
+    job: my-job
 ```
 
 ### Notes/Constraints/Caveats (Optional)
@@ -148,19 +210,35 @@ because work lists can be implemented in a startup script using the completion
 index as building block.
 * The semantics of an indexed Job are similar to a StatefulSet, in the sense
 that Pods have an associated index.
-However, the APIs have a major difference: a StatefulSet doesn't have completion
-semantics, as opposed to Jobs.
+However, the APIs have major differences:
+  - a StatefulSet doesn't have completion semantics, as opposed to Jobs.
+  - a StatefulSet creates pods serially, whereas Job creates all Pods in
+    parallel.
 
 [indexed Job]: https://github.com/kubernetes/community/blob/b21d1b27c8c748bf81283c2d89cde2becb5f2709/contributors/design-proposals/apps/indexed-job.md
 
 ### Risks and Mitigations
 
-Jobs have a known issue in which more than one Pod can be started even if
-parallelism and completion are set to 1 ([reference]). In the case of indexed
-Jobs, this translates to more than one Pod having the same index.
+- More than one pod created per index.
+
+  Jobs have a known issue in which more than one Pod can be started even if
+  parallelism and completion are set to 1 ([reference]). In the case of indexed
+  Jobs, this translates to more than one Pod having the same index.
+  
+  Just like for existing Job patterns, workloads have to handle duplicates at the
+  application level.
+
+- Scalability and latency of DNS programming.
 
-Just like for existing Job patterns, workloads have to handle duplicates at the
-application level.
+  DNS programming requires the update of EndpointSlices and writing DNS records.
+  This might not scale well for short-lived Jobs with high number of parallelism.
+  Moreoever, Pods need to be prepared to retry lookups in the case were the
+  records didn't have time to update.
+  
+  However, network programming is opt-in (users need to create a matching
+  headless Service). Moreover, workloads have other means of obtaining IPs,
+  such as querying/watching the API server. Vendors can also choose to implement
+  alternate DNS programming tailored for Jobs.
 
 [reference]: https://kubernetes.io/docs/concepts/workloads/controllers/job/#handling-pod-and-container-failures
 
@@ -197,6 +275,8 @@ type JobSpec struct {
   // `Indexed` means that the Pods of a
   // Job get an associated completion index from 0 to (.spec.completions - 1),
   // available in the annotation batch.kubernetes.io/job-completion-index.
+  // The Pod hostnames are set to $(job-name)-$(index) and the names to
+  // $(job-name)-$(index)-$(random-suffix).
   // The Job is considered complete when there is one successfully completed Pod
   // for each index.
   // When value is `Indexed`, .spec.completions must be specified and
@@ -259,6 +339,13 @@ The Job controller doesn't add the environment variable if there is a name
 conflict with an existing environment variable. Users can specify other
 environment variables for the same annotation.
 
+The Pod name takes the form `$(job-name)-$(index)-$(random-string)`,
+which can be used for quickly identifying Pods for a specific index.
+
+The Pod hostname takes the form `$(job-name)-$(index)` which can be used to
+address the Pod from others, when the Job is used in combination with a headless
+Service.
+
 ### Job completion and restart policy
 
 When dealing with Indexed Jobs, the Job controller keeps track of Pod
@@ -317,7 +404,7 @@ Reducing parallelism is unaffected by completion index.
 
 Unit, integration and E2E tests cover the following Indexed Job mechanics:
 
-  - Creation with indexed Pod names and index annotations.
+  - Creation with index annotations and indexed pod hostnames.
   - Scale up and down.
   - Pod failures.
   
@@ -335,6 +422,7 @@ gate enabled and disabled.
 #### Alpha -> Beta Graduation
 
 - Complete features:
+  - Index as part of the pod name and hostname.
   - Indexed Jobs when tracking completion with finalizers.
     [kubernetes/enhancements#2307](https://github.com/kubernetes/enhancements/issues/2307).
     
@@ -517,7 +605,8 @@ the existing API objects?**
       than 1MB.
   
   - API type(s): Pod, only when created with the new completion mode.
-  - Estimated increase in size: new annotation of about 50 bytes.
+  - Estimated increase in size: new annotation of about 50 bytes and hostname
+    which includes the index.
 
 * **Will enabling / using this feature result in increasing time taken by any 
 operations covered by [existing SLIs/SLOs]?**