Network latency SLI #2636

wojtek-t · 2018-09-06T13:56:34Z

/assign @bowei @thockin @dcbw @caseydavenport
@kubernetes/sig-network-pr-reviews @kubernetes/sig-scalability-pr-reviews

wojtek-t · 2018-09-18T08:13:00Z

@bowei @thockin @kubernetes/sig-network-api-reviews - friendly ping

bowei · 2018-09-18T17:03:25Z

sig-scalability/slos/network_latency.md

+
+| Status | SLI | SLO |
+| --- | --- | --- |
+| __WIP__ | In-cluster network latency from a single node, measured as latency of per second ping to kubernetes.default.svc.cluster.local/, measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all nodes) per cluster-day <= X |


Should we use the master for this SLI? It is "special" in some sense, as in some implementations, the master lives on a different networking plane.

Also, ping from where. And you probably mean TCP RTT, not ICMP.

Yes - the "different networking plane" argument is very good one. We should probably use some service that lives on nodes.
Do we have any service that we can use for that (in vanilla setup)?

About the ping - it's described above (maybe it should be here too). It's from nodes [described in the last section]
And yes - I think it's TCP RTT.

(Also, we should probably use a non-TLS-based endpoint to eliminate that latency)

Does it make sense to say "RTT to a null service" where null service is defined to be some hello-world serving container?

Also, we should probably use a non-TLS-based endpoint to eliminate that latency

Yes - that absolutely makes sense.

Does it make sense to say "RTT to a null service" where null service is defined to be some hello-world serving container?

Yes - that would be the ideal goal. Though, what I wanted to avoid is to create a dedicated service for that (or at least, creating a dedicated pods to serve those service), because we would like to measure that in user clusters too, and we don't want to consume additional user resources.

Maybe the workaround for it would be to make this "null service" be exposed from kube-proxy pods and then we would just need to create a service (which is not that big deal).
The drawback of that is that then there is a backend on every single node, so this won't work good in large clusters.
[Ideally, i would say let's expose it from a subset of kube-proxies, but then we would need to add labels on some kube-proxies, which then turns out to be too complicated, i think...]

Maybe you have something better on your mind?

As mentioned above we don't seem to have anything satisfying our needs now.
Making a service on top of kube-proxies will be problematic in large clusters, and choosing only a subset complicates cluster setup.

So we decided for requiring that null-service to be setup by cluster administrator.

sig-scalability/slos/network_latency.md

bowei · 2018-10-04T00:07:28Z

sig-scalability/slos/network_latency.md

+
+### Other notes
+- We obviously can't give any guarantee in a general case, because cluster
+administrators may configure cluster as the want (e.g. number of DNS replicas).


the => they

wojtek-t

@bowei - PTAL

wojtek-t · 2018-10-25T11:29:39Z

sig-scalability/slos/network_latency.md

+
+### Other notes
+- We obviously can't give any guarantee in a general case, because cluster
+administrators may configure cluster as the want (e.g. number of DNS replicas).


sig-scalability/slos/network_latency.md

wojtek-t · 2018-10-25T11:33:59Z

sig-scalability/slos/network_latency.md

+
+| Status | SLI | SLO |
+| --- | --- | --- |
+| __WIP__ | In-cluster network latency from a single node, measured as latency of per second ping to kubernetes.default.svc.cluster.local/, measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all nodes) per cluster-day <= X |


As mentioned above we don't seem to have anything satisfying our needs now.
Making a service on top of kube-proxies will be problematic in large clusters, and choosing only a subset complicates cluster setup.

So we decided for requiring that null-service to be setup by cluster administrator.

bowei · 2018-11-05T18:01:44Z

cc: @lzang

thockin · 2018-11-05T19:19:21Z

sig-scalability/slos/network_latency.md

+easy to measure.
+- The prober reporting that is fairly trivial and itself needs only negligible
+amount of resources. However, to avoid any visible overhead in the cluster
+(in terms of additionally needed components) **we will make it part of kube-proxy


This is going to test (as you stated in the SLI column) traffic from nodes. I don't think that is the most representative or useful metric. It seems more correct to test from a pod, through a service, to a pod, across nodes in the same primary topology.

I don't know how to generically express "same primary topology". In GCP that is zone, but topology is arbitrary -- maybe that needs to be parameterized?

We might ALSO want to SLI pod-to-pod without a service (or with a headless service) with the same conditions.

@johnbelamaric

This is going to test (as you stated in the SLI column) traffic from nodes. I don't think that is the most representative or useful metric. It seems more correct to test from a pod, through a service, to a pod, across nodes in the same primary topology.

Kube-proxy is also a pod. So why this isn't what you want?

Re sam primary topology - given that we currently don't really support that fully, I'm a bit sceptical about doing that. Once we have topology-aware service routing (#2846) we can update that SLI.

@thockin - WDYT?

Discussed that offline with @thockin
The main points were:

we can't use kube-proxy, because it's in host network - so I switched that to use dedicated prober

about topologies - making that explicit that we acknowledge very different latencies if nodes are in different topologies - I added explicit point about that.

wojtek-t · 2018-11-14T11:46:07Z

@thockin @bowei - Tim`s comments are now addressed (after discussing that deeper offline). PTAL

thockin · 2018-11-14T12:57:53Z

/lgtm
/approve

lzang · 2018-11-14T17:50:29Z

Are we going to measure this SLI in users' clusters in prod? From the discussion above, it seems that we would need user to setup this test as it consumes their resource. Do we know if users are willing to do that? Also would the result vary significantly due to user cluster's load level/deployment scenario? How do we account for that when we get the data? For example, the result could be different if user is running istio side-car or enable network policy.

wojtek-t · 2018-11-15T08:27:59Z

Are we going to measure this SLI in users' clusters in prod?

It depends on the provider whether they are willing to do that, whether they have permissions from customers etc. Yes - it will vary, but that's the purpose of different SLIs - to know where we are :-)

wojtek-t · 2018-11-19T14:08:29Z

/approve

k8s-ci-robot · 2018-11-19T14:08:36Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~sig-scalability/slos/OWNERS~~ [wojtek-t]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot assigned bowei, caseydavenport and dcbw Sep 6, 2018

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Sep 6, 2018

k8s-ci-robot assigned thockin Sep 6, 2018

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. labels Sep 6, 2018

k8s-ci-robot requested review from calebamiles and countspongebob September 6, 2018 13:56

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Sep 6, 2018

k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Sep 18, 2018

bowei reviewed Sep 18, 2018

View reviewed changes

sig-scalability/slos/network_latency.md Show resolved Hide resolved

bowei reviewed Oct 4, 2018

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 6, 2018

wojtek-t force-pushed the network_latency branch 2 times, most recently from c8fefce to 522e84e Compare October 25, 2018 11:34

wojtek-t commented Oct 25, 2018

View reviewed changes

wojtek-t force-pushed the network_latency branch from 522e84e to 02ee22e Compare October 25, 2018 11:39

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 25, 2018

thockin reviewed Nov 5, 2018

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 9, 2018

Network latency SLI

70ffdf1

wojtek-t force-pushed the network_latency branch from 02ee22e to 70ffdf1 Compare November 14, 2018 11:45

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 14, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2018

k8s-ci-robot merged commit d5aba5e into kubernetes:master Nov 19, 2018

wojtek-t deleted the network_latency branch January 3, 2019 11:52

Network latency SLI #2636

Network latency SLI #2636

Uh oh!

Conversation

wojtek-t commented Sep 6, 2018

Uh oh!

wojtek-t commented Sep 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bowei commented Nov 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t commented Nov 14, 2018

Uh oh!

thockin commented Nov 14, 2018

Uh oh!

lzang commented Nov 14, 2018

Uh oh!

wojtek-t commented Nov 15, 2018

Uh oh!

wojtek-t commented Nov 19, 2018

Uh oh!

k8s-ci-robot commented Nov 19, 2018

Uh oh!

Uh oh!