Skip to content

Network latency SLI #2636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 19, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions sig-scalability/slos/network_latency.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
## In-cluster network latency SLIs/SLOs details

### Definition

| Status | SLI | SLO |
| --- | --- | --- |
| __WIP__ | In-cluster network latency from a single prober pod, measured as latency of per second ping from that pod to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day <= X |

### User stories
- As a user of vanilla Kubernetes, I want some guarantee how fast my http
request to some Kubernetes service reaches its endpoint

### Other notes
- We obviously can't give any guarantee in a general case, because cluster
administrators may configure cluster as they want.
- As a result, we define the SLI to be very generic (no matter how your cluster
is set up), but we provide SLO only for default installations with an additional
requirement that low-level RTT between nodes is lower than Y.
- Network latency is one of the most crucial aspects from the point of view
of application performance, especially in microservices world. As a result, to
meet user expectations, we need to provide some guarantees arount that.
- We decided for the SLI definition as formulated above, because:
- it represents a user oriented end-to-end flow - it involves among others
latency of in-cluster network programming mechanism (e.g. iptables). <br/>
__TODO:__ We considered making DNS resolution part of it, but decided not
to mix them. However, longer term we should consider joining them.
- it is easily measurable in all running clusters in which we can run probers
(e.g. measuring request latencies coming from all pods on a given
node would require some additional instrumentation, such as a side car for
each of them, and that overhead may be not acceptable in many cases)
- it is not application-specific

### Caveats
- The SLI is formulated for a prober pods, even though users are mostly
interested in the aggregation across all pods (that is done only at the SLO
level). However, that provides very similar guarantees and makes it fairly
easy to measure.
- The RTT between nodes may significantly differ, if nodes are in different
topologies (e.g. GCP zones). However, given that topology-aware service routing
is not natively supported in Kubernetes yet, we explicitly acknowledge that
depending on the pinged endpoint, results may signiifcantly differ if nodes
are spanning multiple topologies.
- The prober reporting that is fairly trivial and itself needs only negligible
amount of resources. Unfortunately there isn't any component to which we can
attach that functionality (e.g. KubeProxy is running in host network), so
**we will create a dedicated set of prober pods**. We will run a set of prober
pods (number proportional to cluster size).
- We don't have any "null service" running in cluster, so an administrator has
to set up one to make the SLI measurable in real cluster. In tests, we will
create a service on top of prober pods.

### Test scenario

__TODO: Describe test scenario.__
1 change: 1 addition & 0 deletions sig-scalability/slos/slos.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ Prerequisite: Kubernetes cluster is available and serving.
| __Official__ | Startup latency of stateless and schedulable pods, excluding time to pull images and run init containers, measured from pod creation timestamp to when all its containers are reported as started and observed via watch, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile per cluster-day<sup>[1](#footnote1)</sup> <= 5s | [Details](./pod_startup_latency.md) |
| __WIP__ | Latency of programming a single (e.g. iptables on a given node) in-cluster load balancing mechanism, measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all programmers (e.g. iptables)) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./network_programming_latency.md) |
| __WIP__ | Latency of programming a single in-cluster dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all dns instances) per cluster-day <= X | [Details](./dns_programming_latency.md) |
| __WIP__ | In-cluster network latency from a single prober pod, measured as latency of per second ping from that pod to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day <= X | [Details](./network_latency.md) |

<a name="footnote1">\[1\]</a> For the purpose of visualization it will be a
sliding window. However, for the purpose of reporting the SLO, it means one
Expand Down