-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Network latency SLI #2636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Network latency SLI #2636
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
## In-cluster network latency SLIs/SLOs details | ||
|
||
### Definition | ||
|
||
| Status | SLI | SLO | | ||
| --- | --- | --- | | ||
| __WIP__ | In-cluster network latency from a single prober pod, measured as latency of per second ping from that pod to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day <= X | | ||
|
||
### User stories | ||
- As a user of vanilla Kubernetes, I want some guarantee how fast my http | ||
request to some Kubernetes service reaches its endpoint | ||
|
||
### Other notes | ||
- We obviously can't give any guarantee in a general case, because cluster | ||
administrators may configure cluster as they want. | ||
- As a result, we define the SLI to be very generic (no matter how your cluster | ||
is set up), but we provide SLO only for default installations with an additional | ||
requirement that low-level RTT between nodes is lower than Y. | ||
- Network latency is one of the most crucial aspects from the point of view | ||
of application performance, especially in microservices world. As a result, to | ||
meet user expectations, we need to provide some guarantees arount that. | ||
- We decided for the SLI definition as formulated above, because: | ||
- it represents a user oriented end-to-end flow - it involves among others | ||
latency of in-cluster network programming mechanism (e.g. iptables). <br/> | ||
__TODO:__ We considered making DNS resolution part of it, but decided not | ||
to mix them. However, longer term we should consider joining them. | ||
- it is easily measurable in all running clusters in which we can run probers | ||
(e.g. measuring request latencies coming from all pods on a given | ||
node would require some additional instrumentation, such as a side car for | ||
each of them, and that overhead may be not acceptable in many cases) | ||
- it is not application-specific | ||
|
||
### Caveats | ||
- The SLI is formulated for a prober pods, even though users are mostly | ||
interested in the aggregation across all pods (that is done only at the SLO | ||
level). However, that provides very similar guarantees and makes it fairly | ||
easy to measure. | ||
- The RTT between nodes may significantly differ, if nodes are in different | ||
topologies (e.g. GCP zones). However, given that topology-aware service routing | ||
is not natively supported in Kubernetes yet, we explicitly acknowledge that | ||
depending on the pinged endpoint, results may signiifcantly differ if nodes | ||
are spanning multiple topologies. | ||
- The prober reporting that is fairly trivial and itself needs only negligible | ||
amount of resources. Unfortunately there isn't any component to which we can | ||
attach that functionality (e.g. KubeProxy is running in host network), so | ||
**we will create a dedicated set of prober pods**. We will run a set of prober | ||
pods (number proportional to cluster size). | ||
- We don't have any "null service" running in cluster, so an administrator has | ||
to set up one to make the SLI measurable in real cluster. In tests, we will | ||
create a service on top of prober pods. | ||
|
||
### Test scenario | ||
|
||
__TODO: Describe test scenario.__ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.