Merge pull request #2586 from wojtek-t/network_programmin_latency

k8s-ci-robot · web-flow · commit b519bab844fe · 2018-09-06T01:33:46.000-07:00
Introduce definition of network programming latency SLI
diff --git a/sig-scalability/slos/network_programming_latency.md b/sig-scalability/slos/network_programming_latency.md
@@ -0,0 +1,93 @@
+## Network programming latency SLIs/SLOs details
+
+### Definition
+
+| Status | SLI | SLO |
+| --- | --- | --- |
+| __WIP__ | Latency of programming a single (e.g. iptables on a given node) in-cluster load balancing mechanism, measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all programmers (e.g. iptables)) per cluster-day <= X |
+
+### User stories
+- As a user of vanilla Kubernetes, I want some guarantee how quickly new backends
+of my service will be targets of in-cluster load-balancing
+- As a user of vanilla Kubernetes, I want some guarantee how quickly deleted
+(or unhealthy) backends of my service will be removed from in-cluster
+load-balancing
+- As a user of vanilla Kubernetes, I want some guarantee how quickly changes
+to service specification (including creation) will be reflected in in-cluster
+load-balancing
+
+### Other notes
+- We are consciously focusing on in-cluster load-balancing for the purpose of
+this SLI, as external load-balancing is clearly provider specific (which makes
+it hard to set the SLO for it).
+- However, in the future it should be possible to formulate the SLI for external
+load-balancing in pretty much the same way for consistency.
+- The SLI measuring end-to-end time from pod creation was also considered,
+but rejected due to being application specific, and thus introducing SLO would
+be impossible.
+
+### Caveats
+- The SLI is formulated for a single "programmer" (e.g. iptables on a single
+node), even though that value itself is not very interesting for the user.
+In case there are multiple programmers in the cluster, the aggregation across
+them is done only at the SLO level (and only that gives a value that is somehow
+interesting for the user). The reason for doing it this is feasibility for
+efficiently computing that:
+  - if we would be doing aggregation at the SLI level (i.e. the SLI would be
+    formulated like "... reflected in in-cluster load-balancing mechanism and
+    visible from 99% of programmers"), computing that SLI would be extremely
+    difficult. It's because in order to decide e.g. whether pod transition to
+    Ready state is reflected, we would have to know when exactly it was reflected
+    in 99% of programmers (e.g. iptables). That requires tracking metrics on
+    per-change base (which we can't do efficiently).
+  - we admit that the SLO is a bit weaker in that form (i.e. it doesn't necessary
+    force that a given change is reflected in 99% of programmers with a given
+		99th percentile latency), but it's close enough approximation.
+
+### How to measure the SLI.
+The method of measuring this SLI is not obvious, so for completeness we describe
+it here how it will be implemented with all caveats.
+1. We assume that for the in-cluster load-balancing programming we are using
+Kubernetes `Endpoints` objects.
+1. We will introduce a dedicated annotation for `Endpoints` object (name TBD).
+1. Endpoints controller (while updating a given `Endpoints` object) will be
+setting value of that annotation to the timestamp of the change that triggered
+this update:
+- for pod transition between `Ready` and `NotReady` states, its timestamp is
+  simply part of pod condition
+- TBD for service updates (ideally we will add `LastUpdateTimestamp` field in
+  object metadata next to already existing `CreationTimestamp`. The data is
+  already present at storage layer, so it won't be hard to propagate that.
+1. The in-cluster load-balancing programmer will export a prometheus metric
+once done with programming. The latency of the operation is defined as
+difference betweem timestamp of then whe operation is done and timestamp
+recorded in the newly introduced annotation.
+
+#### Caveats
+There are a couple of caveats to that measurement method:
+1. Single `Endpoints` object may batch multiple pod state transition. <br/>
+In that case, we simply choose the oldest one (and not expose all timestamps
+to avoid theoretically unbounded growth of the object). That makes the metric
+imprecise, but the batching period should be relatively small comparing
+to whole end-to-end flow.
+1. A single pod may transition its state multiple times within batching
+period. <br/>
+For that case, we will add additional cache in Endpoints controller caching
+the first observed transition timestamp for each pod. The cache will be
+cleared when controller picks up a pod into Endpoints object update. This is
+consistent with choosing the oldest update in the above point. <br/>
+Initially, we may consider simply ignoring this fact.
+1. Components may fall out of watch window history and thus miss some watch
+events. <br/>
+This may be the case for both Endpoints controller or kube-proxy (or other
+network programmers if used instead). That becomes a problem when a single
+object changed multiple times in the meantime (otherwise informers will
+deliver handlers on relisting). Additionally, this can happen only when
+components are too slow in processing events (that would already be reflected
+in metrics) or (sometimes) after kube-apiserver restart. Given that, we are
+going to neglect this problem to avoid unnecessary complications for little
+or no gain.
+
+### Test scenario
+
+__TODO: Describe test scenario.__
diff --git a/sig-scalability/slos/slos.md b/sig-scalability/slos/slos.md
@@ -106,6 +106,7 @@ Prerequisite: Kubernetes cluster is available and serving.
 | __Official__ | Latency of mutating API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 1s | [Details](./api_call_latency.md) |
 | __Official__ | Latency of non-streaming read-only API calls for every (resource, scope pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, scope) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> (a) <= 1s if `scope=resource` (b) <= 5s if `scope=namespace` (c) <= 30s if `scope=cluster` | [Details](./api_call_latency.md) |
 | __Official__ | Startup latency of stateless and schedulable pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 5s | [Details](./pod_startup_latency.md) |
+| __WIP__ | Latency of programming a single (e.g. iptables on a given node) in-cluster load balancing mechanism, measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all programmers (e.g. iptables)) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./networking_programming_latency.md) |
 
 <a name="footnote1">\[1\]</a> For the purpose of visualization it will be a
 sliding window. However, for the purpose of reporting the SLO, it means one