|
| 1 | +## In-cluster network latency SLIs/SLOs details |
| 2 | + |
| 3 | +### Definition |
| 4 | + |
| 5 | +| Status | SLI | SLO | |
| 6 | +| --- | --- | --- | |
| 7 | +| __WIP__ | In-cluster network latency from a single node, measured as latency of per second ping from node to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all nodes) per cluster-day <= X | |
| 8 | + |
| 9 | +### User stories |
| 10 | +- As a user of vanilla Kubernetes, I want some guarantee how fast my http |
| 11 | +request to some Kubernetes service reaches its endpoint |
| 12 | + |
| 13 | +### Other notes |
| 14 | +- We obviously can't give any guarantee in a general case, because cluster |
| 15 | +administrators may configure cluster as they want. |
| 16 | +- As a result, we define the SLI to be very generic (no matter how your cluster |
| 17 | +is set up), but we provide SLO only for default installations with an additional |
| 18 | +requirement that low-level RTT between nodes is lower than Y. |
| 19 | +- Network latency is one of the most crucial aspects from the point of view |
| 20 | +of application performance, especially in microservices world. As a result, to |
| 21 | +meet user expectations, we need to provide some guarantees arount that. |
| 22 | +- We decided for the SLI definition as formulated above, because: |
| 23 | + - it represents a user oriented end-to-end flow - it involves among others |
| 24 | + latency of in-cluster network programming mechanism (e.g. iptables). |
| 25 | + __TODO:__ We considered making DNS resolution part of it, but decided not |
| 26 | + to mix them. However, longer term we should consider joining them. |
| 27 | + - it is easily measurable in all running clusters without any significant |
| 28 | + overhead (e.g. measuring request latencies coming from all pods on a given |
| 29 | + node would require some additional instrumentation, such as a side car for |
| 30 | + each of them, and that overhead may be not acceptable in many cases) |
| 31 | + - it is not application-specific |
| 32 | + |
| 33 | +### Caveats |
| 34 | +- The SLI is formulated for a single node, even though users are mostly |
| 35 | +interested in the aggregation across all nodes (that is done only at the SLO |
| 36 | +level). However, that provides very similar guarantees and makes it fairly |
| 37 | +easy to measure. |
| 38 | +- The prober reporting that is fairly trivial and itself needs only negligible |
| 39 | +amount of resources. However, to avoid any visible overhead in the cluster |
| 40 | +(in terms of additionally needed components) **we will make it part of kube-proxy |
| 41 | +(which is running on every node anyway) and make it possible to switch it off |
| 42 | +by via config option (potentially also a flag)**. |
| 43 | +- We don't have any "null service" running in cluster, so an administrator has |
| 44 | +to set up one to make the SLI measurable in real cluster. We will provide |
| 45 | +an appropriate "null service" in our tests. |
| 46 | + |
| 47 | +### Test scenario |
| 48 | + |
| 49 | +__TODO: Describe test scenario.__ |
0 commit comments