Skip to content

Commit c52c7e8

Browse files
author
m1093782566
committed
add service topology kep
1 parent 04ff155 commit c52c7e8

File tree

2 files changed

+219
-1
lines changed

2 files changed

+219
-1
lines changed

keps/NEXT_KEP_NUMBER

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
31
1+
32
+218
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
---
2+
kep-number: 31
3+
title: Topology-aware service routing
4+
status: Pending
5+
authors:
6+
- "@m1093782566"
7+
owning-sig: sig-network
8+
reviewers:
9+
- "@thockin"
10+
- "@johnbelamaric"
11+
approvers:
12+
- "@thockin"
13+
creation-date: 2018-10-24
14+
last-updated: 2018-10-26
15+
---
16+
17+
# Topology-aware service routing
18+
19+
## Table of Contents
20+
21+
* [Motivation](#motivation)
22+
* [Goals](#goals)
23+
* [Non\-goals](#non-goals)
24+
* [User cases](#user-cases)
25+
* [Background](#background)
26+
* [Proposal](#proposal)
27+
* [Implementation History](#implementation-history)
28+
* [Service API changes](#service-api-changes)
29+
* [Endpoints API changes](#endpoints-api-changes)
30+
* [Endpoints Controller changes](#endpoints-controller-changes)
31+
* [Kube-proxy changes](#kube-proxy-changes)
32+
* [DNS changes](#dns-changes)
33+
* [CoreDNS changes](#coredns-changes)
34+
* [Kube-dns changes](#kube-dns-changes)
35+
36+
## Motivation
37+
38+
Figure out a generic way to implement the "local service" route, say "topology aware routing of service".
39+
40+
Locality is defined by user, it can be any topology-related thing. "Local" means the "same topology level", e.g. same node, same rack, same failure zone, same failure region, same cloud provider etc. Two nodes are considered "local" if they have the same value for a particular label, called the "topology key".
41+
42+
### Goals
43+
44+
A generic way to support topology aware routing of services in arbitrary topological domains, e.g. node, rack, zone, region, etc. by node labels.
45+
46+
### Non-goals
47+
48+
* Scheduler spreading to implement this sort of topology guarantee
49+
* Dynamic Availability
50+
* Health-checking
51+
* Capacity-based or load-based spillover
52+
53+
### User cases
54+
55+
* Logging agents such as fluentd. Deploy fluentd as DaemonSet and applications only need to communicate with the fluentd in the same node.
56+
* For a sharded service that keeps per-node local information in each shard.
57+
* Authenticating proxies such as [aws-es-proxy](https://github.com/kopeio/aws-es-proxy).
58+
* In container identity wg, being able to give daemonset pods a unique identity per host is on the 2018 plan, and ensuring local pods can communicate to local node services securely is a key goal there. -- from @smarterclayton
59+
* Regional data costs in multi-AZ setup - for instance, in AWS, with a multi-AZ setup, half of the traffic will switch AZ, incurring regional data Transfer costs, whereas if something was local, it wouldn't hit the network.
60+
* Performance benefit (node local/rack local) is lower latency/higher bandwidth.
61+
62+
### Background
63+
64+
It's a pain point for multi-zone clusters deployment since cross-zone network traffic being charged, while in-zone is not. In addition, cross-node traffic may carry sensitive metadata from other nodes. Therefore, users always prefer the service backends that close to them, e.g. same zone, rack and host etc. for security, performance and cost concerns.
65+
66+
Kubernetes scheduler can constraining a pod to only be able to run on particular nodes/zones. However, Kubernetes service proxy just randomly picks an available backend for service routing and this one can be very far from the user, so we need a topology-aware service routing solution in Kubernetes. Basically, to find the nearest service backend. In other words, allowing people to configure if ALWAY reach a to local service backend. In this way, they can reduce network latency, improve security, save money and so on. However, because topology is arbitrary, zone, region, rack, generator, whatever, who knows? We should allow arbitrary locality.
67+
68+
`ExternalTrafficPolicy` was added in v1.4, but only for NodePort and external LB traffic. NodeName was added to `EndpointAddress` to allow kube-proxy to filter local endpoints for various future purposes.
69+
70+
Based on our experience of advanced routing setup and recent demo of enabling this feature in Kubernetes, this document would like to introduce a more generic way to support arbitrary service topology.
71+
72+
## Proposal
73+
74+
This proposal builds off of earlier requests to [use local pods only for kube-proxy loadbalancing](https://github.com/kubernetes/kubernetes/issues/7433) and [node-local service proposal](https://github.com/kubernetes/kubernetes/pull/28637). But, this document proposes that not only the particular "node-local" user case should be taken care, but also a more generic way should be figured out.
75+
76+
Locality is an "user-defined" thing. When we set topology key "hostname" for service, we expect node carries different node labels on the key "hostname".
77+
78+
Users can control the level of topology. For example, if someone run logging agent as a daemonset, he can set the "hard" topology requirement for same-host. If "hard" is not met, then just return "service not available".
79+
80+
And if someone set a "soft" topology requirement for same-host, say he "preferred" same-host endpoints and can accept other hosts when for some reasons local service's backend is not available on some host.
81+
82+
If multiple endpoints satisfy the "hard" or "soft" topology requirement, we will randomly pick one by default.
83+
84+
Routing decision is expected to be implemented by kube-proxy and kube-dns/coredns for headless service.
85+
86+
87+
## Implementation history
88+
89+
### Service API changes
90+
91+
Users need a way to declare what service is local and the definition of local backends for the particular service.
92+
93+
In this proposal, we give the service owner a chance to configure the service locality things. A new property would be introduced to `ServiceSpec`, say `topologyKeys` - it's a string slice and should be optional.
94+
95+
```go
96+
type ServiceSpec struct {
97+
// topologyKeys is a preference-order list of topology keys. If backends exist for
98+
// index [0], they will always be chosen; only if no backends exist for index [0] will backends for index [1] be considered.
99+
// If this field is specified and all indices have no backends, the service has no backends, and connections will fail. We say these requirements are hard.
100+
// In order to experss soft requirement, we may give a special node label key "" as it means "match all nodes".
101+
TopologyKeys []string `json:"topologyKeys" protobuf:"bytes,1,opt,name=topologyKeys"`
102+
}
103+
```
104+
105+
An example of `Service` with topology keys:
106+
107+
```yaml
108+
kind: Service
109+
metadata:
110+
name: service-local
111+
spec:
112+
topologyKeys: ["host", "zone"]
113+
```
114+
115+
116+
In our example above, we will firstly try to find the backends in the same host. If no backends match, we will then try the lucky of same zone. If finally we can't find any backends in the same host or same zone, then we say the service has no satisfied backends and connections will fail.
117+
118+
If we configure topologyKeys as `["host", ""]`, we just do the effort to find the backends in the same host and will not fail the connection if no matched backends found.
119+
120+
### Endpoints API changes
121+
122+
Although `NodeName` was already added to `EndpointAddress`, we want `Endpoints` to carry more node's topological informations so that allowing more topology information other than hostname.
123+
124+
This proposal will create a new `Topologies` field in `Endpoints.Subsets.Addresses` for identifying what topological domain the backend pod exists.
125+
126+
```go
127+
type EndpointAddress struct {
128+
// labels of node hosting the endpoint
129+
Topologies map[string]string
130+
}
131+
```
132+
133+
Please note that we only copy the labels that we know are needed by the topological constraints. In other words, only copying the labels which are used by `serviceSpec.topologyKeys` from node to endpoint.
134+
135+
### Endpoints Controller changes
136+
137+
Endpoint Controller will populate the `Topologyies` property for each `EndpointAddress`. We want `EndpointAddress.Topology` to tell the LB, such as kube-proxy what topological domain the endpoint exists.
138+
139+
Endpoints controller will need to watch Nodes for knowing labels of node hosting the endpoint and copy the node labels referenced in the service spec's topology constraints to EndpointAddress.
140+
141+
Endpoints Controller will also maintain an extra cache: `NodeToPodsCache`.
142+
`NodeToPodsCache` maps the node's name to the pods running on it. Node's add, delete and labels' change will trigger `NodeToPodsCache` re-index.
143+
144+
So, the new logic of endpoint controller might be like:
145+
146+
```go
147+
go watch Node
148+
// In each sync loop, for a given service, sync its endpoints
149+
for i, pod := range service backends; do
150+
node := nodeCache[pod.Spec.NodeName]
151+
endpointAddress := &v1.EndpointAddress {}
152+
// Copy all topology-related labels of node to all the endpoints running on it.
153+
// We can only include node labels referenced in the service spec's topology constraints
154+
for _, topologyKey := range service.TopologyKeys; do
155+
endpointAddress.Topologies[topoKey] = node.Labels[topologyKey]
156+
done
157+
endpoints.Subsets[i].Addresses = endpointAddress
158+
done
159+
```
160+
161+
### Kube-proxy changes
162+
163+
Kube-proxy will respect topology keys for each service, so kube-proxy on different nodes may create different proxy rules.
164+
165+
Kube-proxy will watch or periodically get(which approach has better performance?) its own node and will find the endpoints that are in the same topological domain as the node if `service.TopologyKeys` is not empty.
166+
167+
The new logic of kube-proxy might be like:
168+
169+
```go
170+
go watch/periodically get node with its nodename
171+
endpointsMeetRequirement := make([]endpointInfo, 0)
172+
for _, topologyKey := range service.TopologyKeys; do
173+
for i := range service's endpoints.Subsets; do
174+
ss := endpoints.Subsets[i]
175+
for j := range ss.Addresses; do
176+
// check if endpoint are in the same topological domain as the node running kube-proxy
177+
if ss.Addresses[j].TopologyKey[topologyKey] == node.Labels[topologyKey]; then
178+
endpointsMeetRequirement = append(endpointsMeetHardRequirement, endpoint)
179+
fi
180+
done
181+
// Randomly pick one if there are some endpoints(>=1) matched
182+
if len(endpointsMeetRequirement) != 0; then
183+
route request to an endpoint randomly
184+
return
185+
fi
186+
done
187+
done
188+
connection fails due to no mactch endpoint
189+
```
190+
191+
### DNS changes
192+
193+
We should consider this kind of topology support for headless service in coredns and kube-dns. As the DNS servers will respect topology keys for each headless service, different clients/pods on different nodes may get different dns response.
194+
195+
#### CoreDNS changes
196+
197+
CoreDNS already has the client IP to Pod mapping and it will need to watch nodes to get topology-related information. When a client/pod request a headless service domain to CoreDNS, CoreDNS will retrieve the node labels of both client and the backend Pods. CoreDNS will only select the IPs of backend Pods which are in the same topological domain with client Pod, and then write A record.
198+
199+
The new logic of CoreDNS might be like:
200+
201+
```go
202+
go watch nodes
203+
clientIP := request.SourceIP
204+
clientPod := clinetIPToPod[clientIP]
205+
clientNode := nodes[clientPod.Spec.NodeName]
206+
for _, topologyKey := range headlessService.TopologyKeys; do
207+
for _, backendPod := range service endpoints; do
208+
backendNode := nodes[backendPod.Spec.NodeName]
209+
if clientNode[topologyKey] = backendNode[topologyKey]; then
210+
write backendPod IP as A record
211+
fi
212+
done
213+
done
214+
```
215+
216+
#### Kube-dns changes
217+
218+
Kube-dns should have the client IP to Pod mapping first and the other new behaviors are same as CoreDNS.

0 commit comments

Comments
 (0)