Notes on DNS programming

alculquicondor · alculquicondor · commit d855b1f131d2 · 2021-04-19T14:20:28.000-04:00
Signed-off-by: Aldo Culquicondor &lt;acondor@google.com&gt;
diff --git a/keps/sig-apps/2214-indexed-job/README.md b/keps/sig-apps/2214-indexed-job/README.md
@@ -63,7 +63,8 @@ This KEP extends kubernetes with user-friendly support for running parallel jobs
 
 Here, parallel means multiple pods per Job. Jobs can be:
 - Embarrassingly parallel, where the pods have no dependencies between each other.
-- Tightly coupled, where the Pods communicate among themselves to make progress.
+- Tightly coupled, where the Pods communicate among themselves to make progress
+  (kubernetes/kubernetes#99497)[https://github.com/kubernetes/kubernetes/issues/99497]
 
 We propose the addition of completion indexes into the Pods of a *Job
 [with fixed completion count]* to support running embarrassingly parallel
@@ -223,7 +224,7 @@ However, the APIs have major differences:
 
 - More than one pod created per index.
 
-  Jobs have a known issue in which more than one Pod can be started even if
+  Jobs have a known rare issue in which more than one Pod can be started even if
   parallelism and completion are set to 1 ([reference]). In the case of indexed
   Jobs, this translates to more than one Pod having the same index.
   
@@ -232,10 +233,21 @@ However, the APIs have major differences:
 
 - Scalability and latency of DNS programming.
 
-  DNS programming requires the update of EndpointSlices and writing DNS records.
+  DNS programming requires the update of EndpointSlices by the endpoint
+  controller and updating DNS records by the DNS provider.
   This might not scale well for short-lived Jobs with high number of parallelism.
-  Moreoever, Pods need to be prepared to retry lookups in the case were the
-  records didn't have time to update.
+  Moreoever, Pods need to be prepared to:
+  - Retry lookups in the case were the records didn't have time to update.
+  - Handle more than one IP for the CNAME. This might happen temporarily when:
+    - the job controller creates more than one pod per index or
+    - the job controller creates a replacement of a failed Pod before the DNS
+      provider clears the record for the failed pod. This will be uncommon
+      as the endpoint controller should see the failed Pod before it sees the
+      replacement Pod.
+    <UNRESOLVED>
+    The recommendation for applications is to request a new DNS resolution until
+    the DNS server returns one IP.
+    </UNRESOLVED>
   
   However, network programming is opt-in (users need to create a matching
   headless Service). Moreover, workloads have other means of obtaining IPs,
@@ -682,11 +694,9 @@ _This section must be completed when targeting beta graduation to a release._
 
   Completion indexes could also be part of the Pod name, leading to stable Pod
   names. This allows 2 things:
-  - Uniqueness for each completion index, freeing applications from having to
-    handle duplicated indexes.
-  - Predictable hostnames, which benefits applications that need to communicate
-    to Pods of a Job (or among Pods of the same Job) without having to do
-    discovery.
+  - Uniqueness for each completion index. This frees applications from having to
+    handle duplicated indexes. When used along with a headless Service, there
+    are less chances for a DNS record to refer to more than one Pod.
   
   Stable pod names require the Job controller to remove failed Pods before
   creating a new one with the same index. This has some downsides:
@@ -696,11 +706,9 @@ _This section must be completed when targeting beta graduation to a release._
     the status of the Job, affecting retry backoffs and backoff limit. This
     needs to change before stable Pod names can be implemented
     [#28486](https://github.com/kubernetes/kubernetes/issues/28486).
-  - Reduced availability of Job Pods per completion index. This happens when
-    a Node becomes unavailable. The Job controller cannot remove such Pods.
-    Either the kubelet in the Node recovers and marks the Pod as failed; or the
-    kube-apiserver removes the Node and the garbage collector removes the orphan
-    Pods.
+  - Reduced availability of Job Pods per completion index as, in addition to
+    the time necessary to create a new Pod, we need to account for the time of
+    deleting the failed Pod.
     
   However, stable Pod names can be offered later as a new value for
   `.spec.completionMode` for Jobs.