Skip to content

Commit c8fefce

Browse files
committed
Network latency SLI
1 parent b519bab commit c8fefce

File tree

2 files changed

+50
-0
lines changed

2 files changed

+50
-0
lines changed
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
## In-cluster network latency SLIs/SLOs details
2+
3+
### Definition
4+
5+
| Status | SLI | SLO |
6+
| --- | --- | --- |
7+
| __WIP__ | In-cluster network latency from a single node, measured as latency of per second ping from node to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all nodes) per cluster-day <= X |
8+
9+
### User stories
10+
- As a user of vanilla Kubernetes, I want some guarantee how fast my http
11+
request to some Kubernetes service reaches its endpoint
12+
13+
### Other notes
14+
- We obviously can't give any guarantee in a general case, because cluster
15+
administrators may configure cluster as they want.
16+
- As a result, we define the SLI to be very generic (no matter how your cluster
17+
is set up), but we provide SLO only for default installations with an additional
18+
requirement that low-level RTT between nodes is lower than Y.
19+
- Network latency is one of the most crucial aspects from the point of view
20+
of application performance, especially in microservices world. As a result, to
21+
meet user expectations, we need to provide some guarantees arount that.
22+
- We decided for the SLI definition as formulated above, because:
23+
- it represents a user oriented end-to-end flow - it involves among others
24+
latency of in-cluster network programming mechanism (e.g. iptables).
25+
__TODO:__ We considered making DNS resolution part of it, but decided not
26+
to mix them. However, longer term we should consider joining them.
27+
- it is easily measurable in all running clusters without any significant
28+
overhead (e.g. measuring request latencies coming from all pods on a given
29+
node would require some additional instrumentation, such as a side car for
30+
each of them, and that overhead may be not acceptable in many cases)
31+
- it is not application-specific
32+
33+
### Caveats
34+
- The SLI is formulated for a single node, even though users are mostly
35+
interested in the aggregation across all nodes (that is done only at the SLO
36+
level). However, that provides very similar guarantees and makes it fairly
37+
easy to measure.
38+
- The prober reporting that is fairly trivial and itself needs only negligible
39+
amount of resources. However, to avoid any visible overhead in the cluster
40+
(in terms of additionally needed components) **we will make it part of kube-proxy
41+
(which is running on every node anyway) and make it possible to switch it off
42+
by via config option (potentially also a flag)**.
43+
- We don't have any "null service" running in cluster, so an administrator has
44+
to set up one to make the SLI measurable in real cluster. We will provide
45+
an appropriate "null service" in our tests.
46+
47+
### Test scenario
48+
49+
__TODO: Describe test scenario.__

sig-scalability/slos/slos.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -107,6 +107,7 @@ Prerequisite: Kubernetes cluster is available and serving.
107107
| __Official__ | Latency of non-streaming read-only API calls for every (resource, scope pair, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, for every (resource, scope) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> (a) <= 1s if `scope=resource` (b) <= 5s if `scope=namespace` (c) <= 30s if `scope=cluster` | [Details](./api_call_latency.md) |
108108
| __Official__ | Startup latency of stateless and schedulable pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 5s | [Details](./pod_startup_latency.md) |
109109
| __WIP__ | Latency of programming a single (e.g. iptables on a given node) in-cluster load balancing mechanism, measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all programmers (e.g. iptables)) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./networking_programming_latency.md) |
110+
| __WIP__ | In-cluster network latency from a single node, measured as latency of per second ping from node to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all nodes) per cluster-day <= X | [Details](./network_latency.md)
110111

111112
<a name="footnote1">\[1\]</a> For the purpose of visualization it will be a
112113
sliding window. However, for the purpose of reporting the SLO, it means one

0 commit comments

Comments
 (0)