Skip to content

Commit 8d605e4

Browse files
author
m1093782566
committed
add service topology kep
1 parent 04ff155 commit 8d605e4

File tree

1 file changed

+257
-0
lines changed

1 file changed

+257
-0
lines changed
+257
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
---
2+
kep-number: 31
3+
title: Topology-aware service routing
4+
status: Pending
5+
authors:
6+
- "@m1093782566"
7+
owning-sig: sig-network
8+
reviewers:
9+
- "@thockin"
10+
- "@johnbelamaric"
11+
approvers:
12+
- "@thockin"
13+
creation-date: 2018-10-24
14+
last-updated: 2018-10-24
15+
---
16+
17+
# Topology-aware service routing
18+
19+
## Table of Contents
20+
21+
* [Motivation](#motivation)
22+
* [Goals](#goals)
23+
* [Non\-goals](#non-goals)
24+
* [User cases](#user-cases)
25+
* [Background](#background)
26+
* [Proposal](#proposal)
27+
* [Implementation History](#implementation-history)
28+
* [API changes](#api-changes)
29+
* [Endpoints Controller changes](#endpoints-controller-changes)
30+
* [Kube-proxy changes](#kube-proxy-changes)
31+
32+
33+
## Motivation
34+
35+
Figure out a generic way to implement the "local service" route, say "topology aware routing of service".
36+
37+
Locality is defined by user, it can be any topology-related thing. "Local" means the "same topology level", e.g. same node, same rack, same failure zone, same failure region, same cloud provider etc.
38+
39+
### Goals
40+
41+
A generic way to support topology aware routing of services in arbitrary topological domains, e.g. node, rack, zone, region, etc. whatever.
42+
43+
### Non-goals
44+
45+
Scheduler spreading to implement this sort of topology guarantee.
46+
47+
### User cases
48+
49+
* Logging agents such as fluentd. Deploy fluentd as DaemonSet and applications only need to communicate with the fluentd in the same node.
50+
* For a sharded service that keeps per-node local information in each shard.
51+
* Authenticating proxies such as [aws-es-proxy](https://github.com/kopeio/aws-es-proxy).
52+
* In container identity wg, being able to give daemonset pods a unique identity per host is on the 2018 plan, and ensuring local pods can communicate to local node services securely is a key goal there. -- from @smarterclayton
53+
* Regional data costs in multi-AZ setup - for instance, in AWS, with a multi-AZ setup, half of the traffic will switch AZ, incurring regional data Transfer costs, whereas if something was local, it wouldn't hit the network.
54+
* Performance benefit (node local/rack local) is lower latency/higher bandwidth.
55+
56+
### Background
57+
58+
It's a pain point for multi-zone clusters deployment since cross-zone network traffic being charged, while in-zone is not. In addition, cross-node traffic may carry sensitive metadata from other nodes. Therefore, users always prefer the service backends that close to them, e.g. same zone, rack and host etc. for security, performance and cost concerns.
59+
60+
Kubernetes scheduler can constraining a pod to only be able to run on particular nodes/zones. However, Kubernetes service proxy just randomly picks an available backend for service routing and this one can be very far from the user, so we need a topology-aware service routing solution in Kubernetes. Basically, to find the nearest service backend. In other words, allowing people to configure if ALWAY reach a to local service backend. In this way, they can reduce network latency, improve security, save money and so on. However, because topology is arbitrary, zone, region, rack, generator, whatever, who knows? We should allow arbitrary locality.
61+
62+
`ExternalTrafficPolicy` was added in v1.4, but only for NodePort and external LB traffic. NodeName was added to `EndpointAddress` to allow kube-proxy to filter local endpoints for various future purposes.
63+
64+
Based on our experience of advanced routing setup and recent demo of enabling this feature in Kubernetes, this document would like to introduce a more generic way to support arbitrary service topology.
65+
66+
## Proposal
67+
68+
This proposal builds off of earlier requests to [use local pods only for kube-proxy loadbalancing](https://github.com/kubernetes/kubernetes/issues/7433) and [node-local service proposal](https://github.com/kubernetes/kubernetes/pull/28637). But, this document proposes that not only the particular "node-local" user case should be taken care, but also a more generic way should be figured out.
69+
70+
Locality is an "user-defined" thing. When we set topology key "hostname" for service, we expect node carries different node labels on the key "hostname".
71+
72+
Users can control the level of topology. For example, if someone run logging agent as a daemonset, he can set the "hard" topology requirement for same-host. If "hard" is not met, then just return "service not available".
73+
74+
And if someone set a "soft" topology requirement for same-host, say he "preferred" same-host endpoints and can accept other hosts when for some reasons local service's backend is not available on some host.
75+
76+
If multiple endpoints satisfy the "hard" or "soft" topology requirement, we will randomly pick one by default. Routing decision is expected to be implemented in L3/4 VIP level such as kube proxy.
77+
78+
79+
## Implementation history
80+
81+
### API changes
82+
83+
The user need a way to declare which service is "local service" and what is the "topology key" of "local service".
84+
85+
This will be accomplished through a new type object `ServicePolicy`.
86+
Endpoint(s) with specify label will be selected by label selector in
87+
`ServicePolicy`, and `ServicePolicy` will declare the topology policy for those endpoints.
88+
89+
Cluster administrators can configure what services are "local" and what topological they prefer via `ServicePolicy`. `ServicePolicy` is a namespace-scope resource and is strict optional. We can configure policies other than topological reference in `ServicePolicy`, but this proposal will not cover them.
90+
91+
```go
92+
type ServicePolicy struct {
93+
TypeMeta
94+
ObjectMeta
95+
96+
// specification of the topology policy of this ServicePolicy
97+
Spec TopologyPolicySpec
98+
}
99+
100+
type TopologyPolicySpec struct {
101+
// ServiceSelector select the service to which this TopologyPolicy object applies.
102+
// One service only can be selected by single ServicePolicy, in this case, the topology rules are combined additively.
103+
// This field is NOT optional an empty ServiceSelector will result in err.
104+
ServiceSelector metav1.LabelSelector `json:"endPointSelector" protobuf:"bytes,1,opt,name=podSelector"`
105+
106+
// topology is used to achieve "local" service in a given topology level.
107+
// User can control what ever topology level they want.
108+
// +optional
109+
Topology ServiceTopology `json:"topology" protobuf:"bytes,1,opt,name=topology"`
110+
}
111+
112+
// Defines a service topolgoy information.
113+
type ServiceTopology struct {
114+
// Valid values for mode are "ignored", "required", "preferred".
115+
// "ignored" is the default value and the associated topology key will have no effect.
116+
// "required" is the "hard" requirement for topology key and an example would be ?only visit service backends in the same zone?.
117+
// If the topology requirements specified by this field are not met, the LB, such as kube-proxy will not pick endpoints for the service.
118+
// "preferred" is the "soft" requirement for topology key and an example would be
119+
// "prefer to visit service backends in the same rack, but OK to other racks if none match"
120+
// +optional
121+
Mode ServicetopologyMode `json:"mode" protobuf:"bytes,1,opt,name=mode"`
122+
123+
// key is the key for the node label that the system uses to denote
124+
// such a topology domain. There are some built-in topology keys, e.g.
125+
// kubernetes.io/hostname, failure-domain.beta.kubernetes.io/zone and failure-domain.beta.kubernetes.io/region etc.
126+
// The built-in topology keys can be good examples and we recommend users switch to a similar mode for portability, but it's NOT enforced.
127+
// Users can define whatever topolgoy key they like since topology is arbitrary.
128+
// +optional
129+
Key string `json:"key" protobuf:"bytes,2,opt,name=key"`
130+
}
131+
```
132+
133+
An example of `ServicePolicy`:
134+
135+
```yaml
136+
kind: ServicePolicy
137+
metadata:
138+
name: service-policy-example
139+
namespace: test
140+
spec:
141+
serviceSelector:
142+
matchLabels:
143+
app: test
144+
topology:
145+
key: kubernetes.io/hostname
146+
mode: required
147+
```
148+
149+
150+
In our example, services in namespace `foo` with label `app=bar` will be chosen. Requests to these services will be routed only to backends on nodes with the same value for `kubernetes.io/hostname` as the originating pod's node. If we want the "same host", probably every host should carry unique `kubernetes.io/hostname` labels.
151+
152+
We can configure multiple `ServicePolicy` targeting a single service. In this case, the service will carry multiple topology requirements and the relationships of all the requirements are logical `AND`, for example,
153+
154+
```yaml
155+
kind: ServicePolicy
156+
metadata:
157+
name: service-policy-example-1
158+
namespace: foo
159+
spec:
160+
serviceSelector:
161+
matchLabels:
162+
app: bar
163+
topology:
164+
key: kubernetes.io/region
165+
mode: required
166+
---
167+
kind: ServicePolicy
168+
metadata:
169+
name: service-policy-example-2
170+
namespace: foo
171+
spec:
172+
serviceSelector:
173+
matchLabels:
174+
app: bar
175+
topology:
176+
key: kubernetes.io/switch
177+
mode: required
178+
```
179+
180+
In our example, services in namespace `foo` with label `app=bar` will be dominated by both `service-policy-example-1` and `service-policy-example-2`. Requests to these services will be routed only to backends that satisfy both same region and same switch as kube-proxy.
181+
182+
Although `NodeName` was already added to `EndpointAddress`, we want `Endpoints` to carry more node's topology informations so that allowing more topology levels other than hostname.
183+
184+
So, create a new `Topology` field in `Endpoints.Subsets.Addresses` for identifying what topology domain the endpoints pod exists, e.g. what host, rack, zone, region etc. In other words, copy the topology-related labels of node hosting the endpoint to `EndpointAddress.Topology`.
185+
186+
```go
187+
type EndpointAddress struct {
188+
// labels of node hosting the endpoint
189+
Topology map[string]string
190+
}
191+
```
192+
193+
### Endpoints Controller changes
194+
195+
Endpoint Controller will populate the `Topology` for each `EndpointAddress`. We want `EndpointAddress.Topology` to tell the LB, such as kube-proxy what topological domain(e.g. host, rack, zone, region etc.) the endpoints is in.
196+
197+
Endpoints controller will need to watch two extra resources: ServicePolicy and Nodes. Watching ServicePolicy for knowing what services have topological preferences. Watching Nodes for knowing labels of node hosting the endpoint and copy the node labels referenced in the service spec's topology constraints to EndpointAddress.
198+
199+
Endpoints Controller will maintain two extra caches: `NodeToPodsCache` and `ServiceToPoliciesCache`.
200+
`NodeToPodsCache` maps the node's name to the pods running on it. Node's add, delete and labels' change will trigger `NodeToPodsCache` reindex.
201+
202+
`ServiceToPoliciesCache` maps the Service's namespaced name to all of its ServicePolicys.
203+
204+
So, the new logic of endpoint controller might like:
205+
206+
```go
207+
go watch Node, ServicePolicy
208+
// In each sync loop, given a service, sync its endpoints
209+
for i, pod := range service backends; do
210+
servicePolicys := ServiceToPoliciesCache[service.Name]
211+
node := nodeCache[pod.Spec.NodeName]
212+
// endpointAddress := &v1.EndpointAddress {}
213+
// Copy all topology-related labels of node hosting endpoint to endpoint
214+
// We can only include node labels referenced in the service spec's topology constraints
215+
for _, servicePolicy := range servicePolicys; do
216+
topoKey := servicePolicy.Topology.Key
217+
endpointAddress.Topology[topoKey] = node.Labels[topoKey]
218+
done
219+
endpoints.Subsets[i].Addresses = endpointAddress
220+
done
221+
```
222+
223+
### Kube-proxy changes
224+
225+
Kube-proxy will respect topology keys for each service, so kube-proxy on different nodes may create different proxy rules.
226+
227+
Kube-proxy will watch its own node and will find the endpoints that are in the same topology domain as the node if `service.Topology.Mode != ignored`.
228+
229+
The new logic of kube-proxy might like:
230+
231+
```go
232+
go watch node with its nodename
233+
switch service.Topology.Mode {
234+
case "ignored":
235+
route request to an endpoint randomly
236+
case "required":
237+
endpointsMeetRequirement := make([]endpointInfo, 0)
238+
topologyKey := service.Topology.Key
239+
// filter out endpoints that does not meet the "hard" topology requirements
240+
for i := range service's endpoints.Subsets; do
241+
ss := endpoints.Subsets[i]
242+
for j := range ss.Addresses; do
243+
// check if endpoint are in the same topology domain as the node running kube-proxy
244+
if ss.Addresses[j].Topology[topologyKey] == node.Labels[topologyKey]; then
245+
endpointsMeetHardRequirement = append(endpointsMeetHardRequirement, endpoint)
246+
fi
247+
done
248+
done
249+
// If multiple endpoints match, randomly select one
250+
if len(endpointsMeetHardRequirement) != 0; then
251+
route request to an endpoint in the endpointsMeetHardRequirement randomly
252+
fi
253+
case "preferred":
254+
// Try to find endpoints that meet the "soft" topology requirements firstly,
255+
// If no one match, kube-proxy tell the kernel all available endpoints and ask it to to route each request randomly to one of them.
256+
}
257+
```

0 commit comments

Comments
 (0)