KEP: Topology-aware service routing #2846

m1093782566 · 2018-10-24T08:39:14Z

It's a pain point for multi-zone clusters deployment since cross-zone network traffic being charged, while in-zone is not. In addition, cross-node traffic may carry sensitive metadata from other nodes. Therefore, users always prefer the service backends that close to them, e.g. same zone, rack and host etc. for security, performance and cost concerns.

Kubernetes scheduler can constraining a pod to only be able to run on particular nodes/zones. However, Kubernetes service proxy just randomly picks an available backend for service routing and this one can be very far from the user, so we need a topology-aware service routing solution in Kubernetes. Basically, to find the nearest service backend.

Anyway, we have already supported this feature in Huawei Cloud and would like to contribute it to community if people are happy.

m1093782566 · 2018-10-24T08:40:51Z

/cc @thockin @johnbelamaric

bowei · 2018-10-24T09:29:08Z

keps/sig-network/0031-service-topology.md

+  // "preferred" is the "soft" requirement for topology key and an example would be
+  // "prefer to visit service backends in the same rack, but OK to other racks if none match"
+  // +optional
+  Mode ServicetopologyMode `json:"mode" protobuf:"bytes,1,opt,name=mode"`


Missing definition for ServiceTopologyMode?

oops, ServiceTopologyMode is a string type.

bowei · 2018-10-24T09:31:27Z

keps/sig-network/0031-service-topology.md

+  ObjectMeta
+
+  // specification of the topology policy of this ServicePolicy
+  Spec TopologyPolicySpec


If we are to have other ServicePolicy in future, does it make sense to have

Spec ServicePolicySpec

then

type ServicePolicySpec struct {
Topology ToplogyPolicySpec
// Room in future for other things
}

Yeah, make sense. The form you suggested has better scalability.

bowei · 2018-10-24T09:33:04Z

keps/sig-network/0031-service-topology.md

+done
+```
+
+### Kube-proxy changes


Should this impact kube-dns as well?

To support it for headless services, it needs to be supported in DNS as well, we can add it to CoreDNS when the time comes.
cc: @chrisohaver @fturib

Good question - it would be somewhat complex for DNS to do split-horizon responses based on client IP, but that is the implication of this (for headless services, anyway). @johnbelamaric
especially since client IP -> Node mapping is not always as easy as a CIDR (some people do not use that and assign arbitrary /32s or multiple /28s)

It requires watching pods which isn't something we would want to do in general, and not something we would want to adapt kube-dns to do. CoreDNS can optionally do that already so if the customer is willing to accept the cost of that we can do it for headless.

For purposes of this KEP we can make is clusterIP services only, and then the headless version can become a feature of CoreDNS maybe. If we do that though, should we enforce anything in the API if the user configures topology for a headless service (assuming we go the route of adding this to service rather than a separate resource)? Otherwise it could cause expectations that aren't really being met.

I think I missed the headless service part in the proposal, will take a deep look at the effect.

For other kinds of service, I assume there is no impact on dns?

The document is updated to include the DNS part changes.

bjhaid · 2018-10-24T14:55:36Z

I love this, any thoughts around allowing weighting of preference, more importantly in the case of preferred ServicePolicy, I'll like to prefer pods that match the specified policy by some percentage but send traffic to other pods not matching the policy. This will help prevent a total outage if the pods matching the policy are being rotated during a deployment or for any reason the pods are being deleted (for maybe eviction reasons) or not reachable due to network partitioning.

johnbelamaric

Thanks for moving this to a KEP. I think if we streamline this to not add new API kinds it will reduce the scope and make it easier to move forward.

johnbelamaric · 2018-10-24T16:24:43Z

keps/sig-network/0031-service-topology.md

+done
+```
+
+### Kube-proxy changes


To support it for headless services, it needs to be supported in DNS as well, we can add it to CoreDNS when the time comes.
cc: @chrisohaver @fturib

johnbelamaric · 2018-10-24T16:28:54Z

keps/sig-network/0031-service-topology.md

+
+Figure out a generic way to implement the "local service" route, say "topology aware routing of service".
+
+Locality is defined by user, it can be any topology-related thing. "Local" means the "same topology level", e.g. same node, same rack, same failure zone, same failure region, same cloud provider etc.


Two nodes are considered "local" if they have the same value for a particular label, called the "topology key".

johnbelamaric · 2018-10-24T16:35:56Z

keps/sig-network/0031-service-topology.md

+    mode: required   
+```
+
+In our example, services in namespace `foo` with label `app=bar` will be dominated by both `service-policy-example-1` and `service-policy-example-2`. Requests to these services will be routed only to backends that satisfy both same region and same switch as kube-proxy.


So, if it said region required and switch preferred, how would that look in the eventual rules from kube-proxy? How would I express:

prefer the local node, but allow any service backend if there are none on the local node

johnbelamaric · 2018-10-24T16:42:41Z

keps/sig-network/0031-service-topology.md

+Endpoint(s) with specify label will be selected by label selector in
+`ServicePolicy`, and `ServicePolicy` will declare the topology policy for those endpoints.
+
+Cluster administrators can configure what services are "local" and what topological they prefer via `ServicePolicy`. `ServicePolicy` is a namespace-scope resource and is strict optional. We can configure policies other than topological reference in `ServicePolicy`, but this proposal will not cover them.


I like the ServicePolicy separate but in general I think there is an effort to avoid adding new core API kinds. Adding this as an optional topology field in Service with a list of key, mode pairs may be easier to move forward.

keps/sig-network/0031-service-topology.md

johnbelamaric · 2018-10-24T16:56:03Z

keps/sig-network/0031-service-topology.md

+
+Endpoint Controller will populate the `Topology` for each `EndpointAddress`. We want `EndpointAddress.Topology` to tell the LB, such as kube-proxy what topological domain(e.g. host, rack, zone, region etc.) the endpoints is in.
+
+Endpoints controller will need to watch two extra resources: ServicePolicy and Nodes. Watching ServicePolicy for knowing what services have topological preferences. Watching Nodes for knowing labels of node hosting the endpoint and copy the node labels referenced in the service spec's topology constraints to EndpointAddress.


Ok, I see that's what you're suggesting here. If we push ServicePolicy into ServiceSpec then we don't need the extra watch for ServicePolicy.

Yes, but we still need to watch node. Right?

yes, endpoints controller will need to watch all nodes. kube-proxy will only need to know about (and watch) its own node (if it doesn't already).

thockin · 2018-10-24T15:57:57Z

keps/sig-network/0031-service-topology.md

+
+Although `NodeName` was already added to `EndpointAddress`, we want `Endpoints` to carry more node's topology informations so that allowing more topology levels other than hostname.
+
+So, create a new `Topology` field in `Endpoints.Subsets.Addresses` for identifying what topology domain the endpoints pod exists, e.g. what host, rack, zone, region etc. In other words, copy the topology-related labels of node hosting the endpoint to `EndpointAddress.Topology`.


Topology is arbitrary. How do we decide what data to use to populate this?

a) It's the entire node.metadata.labels

b) It's from a new field on node, similar to labels but distinct for topology

c) it's a subset of node labels (e.g. a known prefix)

d) some rule copies topology-related labels from node->pod (and then repeat this question)

e) only copying the labels that we know are needed by the topology constraints (e.g. service slices on foo.com/rack, so we copy that label from nodes -> endpoints) ?

Or something else?

Below it suggests copying the labels based on which ones are used for the ServicePolicy that select this service. I think if we put the keys in the service spec then that's fairly simple to do here.

thockin · 2018-10-24T15:59:09Z

keps/sig-network/0031-service-topology.md

+
+### Non-goals
+
+Scheduler spreading to implement this sort of topology guarantee.


non-goal: dynamic availability, health-checking, capacity-based or load-based spillover

Maybe just say: Definition of topology other than by node labels.

I want to be clear to readers that we're not requiring dynamic balancing here, which would be a major departure.

thockin · 2018-10-24T16:01:05Z

keps/sig-network/0031-service-topology.md

+
+The user need a way to declare which service is "local service" and what is the "topology key" of "local service".
+
+This will be accomplished through a new type object `ServicePolicy`.


I wonder if ServicePolicy is too complicated for this. I thought for a while we would need a lot more configurability, but now that we have reduced it to 1 or 2 fields, I am not sure this meets the bar for a new resource. We need to majorly refactor Service and Endpoints anyway, so maybe it's simpler to just be simpler for now?

e.g. Just add a single field to ServiceSpec:

// Preference-order list of topology keys. How this list is interpreted is implementation-defined. // E.g. a valid implementation might be to treat this field as a hard constraint: if backends exist for // index [0], they will always be chosen; only if no backends exist for index [0] will backends // for index [1] be considered. Another valid implementation might be to monitor load and // "spill over" into less-preferred topologies dynamically. // // If this field is specified and all indices have no backends, the service has no backends, and // connections will fail. Topologies []string

(the field name can be debated)

We could argue that we need the "fall back to global" mode option (hard/soft), but even that could be satisfied by simply labeling all nodes or maybe by allowing "" as a special case meaning "match all nodes".

Less API is more good. How little can we get away with?

We can get rid of the hard/soft by the "" meaning match all. But it seems like a bit of a "magic" value. There is logic to it, but it's not explicit.

So we have:
topologyKey: ["node", "zone", ""]

versus

topology: - key: node mode: preferred - key: zone mode: preferred

🤔

In the former, which is nice and compact, you have this implicit logic:

all values are a preference except the last value which is a requirement

And that is really the ONLY logic that makes sense. In the latter you can express meaningless things like:

topology: - key: node mode: required - key: zone mode: required

In fact since topology is arbitrary, you could come up with rules that are not satisfied by any endpoint, although that would be a pretty pathological case.

So...I am leaning towards the first one. I wonder though if the "hard" requirement of the last key should not be the default. That is, which would be least surprising to the user - to restrict to only those topologies unless they add the magic value, or to fallback to any endpoint unless there is some other magic value?

Slice of topology key would be simpler in API and we can get ride of the new resource. But seems it's not straightforward for me to express HARD requirement. Can you give an example?

Hard:

topologyKey: ["kubernetes.io/hostname"]

Soft:

topologyKey: ["kubernetes.io/hostname", ""]

This seems pretty natural. But my point above is asking which is more common, or more expected, hard or soft? If we want it to be soft unless the user is explicit about wanting it be a hard requirement (which is "safer" in a sense), then we need to treat the first one as soft and do something to flag that it's hard.

It could be worthwhile to solicit more input on user expectations for this, great point.

My assumption (with little data) is that "hard" is the more common case. Most use cases have been "give me my node-local agent".

It's not that I am making the last value special in its last-ness. I am saying that (yeah, it's a little ugly) "" is an implicit label that every node satisfies. E.g. if I wrote:

topologyKey: ["","kubernetes.io/hostname"]

Callers would never ever evaluate the "hostname" key because the "" key would match any node, and should never be empty (except in a 0-node cluster).

So yeah, it's "special" but by it's meaning, not it's position.

I understand that it's not position, and I think it's easy to figure out what it means when you see it. I just think it may be non-obvious that in order to have it match "any" you need an empty string in there. If the name were topologyConstraints or topologiesAllowed or something similar (it's really just labels not inherently "topology") then my concern goes away. The topologyKey name comes from the podspec, which has additional context beyond just that name.

thockin · 2018-10-24T18:14:06Z

keps/sig-network/0031-service-topology.md

+Endpoint(s) with specify label will be selected by label selector in
+`ServicePolicy`, and `ServicePolicy` will declare the topology policy for those endpoints.
+
+Cluster administrators can configure what services are "local" and what topological they prefer via `ServicePolicy`. `ServicePolicy` is a namespace-scope resource and is strict optional. We can configure policies other than topological reference in `ServicePolicy`, but this proposal will not cover them.


This is the crux for me - why would cluster admins decide this? It needs to be the same people deciding the delpolyment (e.g, if I deploy zonal, I want service zonal).

This does raise a point that we really do want Deployment to be more topology aware, so I can have N instances in each zone. I know there's a KEP for that, don't know the status

I remember we have discussed the question on if service policy should be part of service. If we configure it outside of service in a more admin-owned way, it looks more like Istio.

Probably users(not only cluster admins) should be able to configure the service policy? After all, service policy is namespace-scoped in this proposal.

yes, I don't think it's an admin thing, it's a service owner thing

Agree. If we had a bunch of policies we wanted to configure or if it was sufficiently complicated, a new resource might be justified, but I don't think it is.

thockin · 2018-10-24T18:16:42Z

keps/sig-network/0031-service-topology.md

+
+In our example, services in namespace `foo` with label `app=bar` will be chosen. Requests to these services will be routed only to backends on nodes with the same value for `kubernetes.io/hostname` as the originating pod's node. If we want the "same host", probably every host should carry unique `kubernetes.io/hostname` labels.
+
+We can configure multiple `ServicePolicy` targeting a single service. In this case, the service will carry multiple topology requirements and the relationships of all the requirements are logical `AND`, for example,


This semantic is pretty complicated, and is one of the drawbacks of the new-resource approach, IMO.

Do we really think there's enough of a use-case where policy does not map 1:1 with service?

It depends, most of my customers only use one service policy, look like:

host-only or deny all

prefer same zone then round robin.

thockin · 2018-10-24T18:41:45Z

keps/sig-network/0031-service-topology.md

+done
+```
+
+### Kube-proxy changes


Good question - it would be somewhat complex for DNS to do split-horizon responses based on client IP, but that is the implication of this (for headless services, anyway). @johnbelamaric
especially since client IP -> Node mapping is not always as easy as a CIDR (some people do not use that and assign arbitrary /32s or multiple /28s)

m1093782566 · 2018-10-25T11:36:45Z

Thanks @thockin @johnbelamaric @bowei

I will take a deep look at the landscape of not adding new kind API and the impact on headless service. Hopefully can finish it soon.

johnbelamaric · 2018-10-25T17:09:25Z

@m1093782566 Great, but don't fret the headless piece too much, I think it would add too much scope to make it part of this. I don't think you can do it without either knowing the source IP -> node mapping, or having a node-local agent filter the IPs from the DNS response (and it would have to know which of those are local).

thockin · 2018-10-25T21:19:09Z

I think headless and DNS need to be considered - it might be doable to implement as a followup, but I think we need to know if/how to solve it.

k8s-ci-robot · 2018-10-26T12:32:26Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: calebamiles

If they are not already assigned, you can assign the PR to them by writing /assign @calebamiles in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

keps/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

m1093782566 · 2018-10-27T09:13:14Z

Per the discussion on API and headless service, I just pushed an update. PTAL. Thanks!

@johnbelamaric Please feel free to correct me especially in the DNS part.

johnbelamaric · 2018-10-31T07:40:51Z

Ok, if number of watches is not a concern then I can see your point. But if amount of data processed is the issue, then wouldn’t Tim’s suggestion of a field selection make a bigger difference than you thought? Especially if we can tweak watches to only fire updates if it is one of the specific watches fields that changes?

…

On Wed, Oct 31, 2018 at 12:25 AM Wojciech Tyczynski < ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In keps/sig-network/0031-service-topology.md <#2846 (comment)>: > +In our example above, we will firstly try to find the backends in the same host. If no backends match, we will then try the lucky of same zone. If finally we can't find any backends in the same host or same zone, then we say the service has no satisfied backends and connections will fail. + +If we configure topologyKeys as `["host", ""]`, we just do the effort to find the backends in the same host and will not fail the connection if no matched backends found. + +### Endpoints API changes + +Although `NodeName` was already added to `EndpointAddress`, we want `Endpoints` to carry more node's topological informations so that allowing more topology information other than hostname. + +This proposal will create a new `Topologies` field in `Endpoints.Subsets.Addresses` for identifying what topological domain the backend pod exists. + +```go +type EndpointAddress struct { + // labels of node hosting the endpoint + Topologies map[string]string +} +``` Adding the field means that we have one more watch (nodes) from the endpoints controller, versus one per node from kube-proxy. Number of watches itself is not the issue. It's amount of data that needs to be processed and send is what is causing issues. So the key factors here is: - the topology labels are changing extremely rarely - so the amount of watch events to send here will be negligible - if we add those to the endpoints, endpoints are changing frequently and an update of a single backend of the service means that the whole endpoints object (containing all endpoints with topology keys etc) are send to all kube-proxies. That seems to be waaaay more expensive. Also, note that we do NOT need to copy all topology labels to all endpoint addresses. Sure - i get that. But we can't assume that people won't be using that feature so we don't need to care about performance. I spoke briefly with @lavalamp <https://github.com/lavalamp> and he seems to think option 5 was viable. What i described here is kind of extension of that. @thockin <https://github.com/thockin> ^^ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2846 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJB4s2mujIs-oJVoV7Scql6fWYbPf-r6ks5uqVBhgaJpZM4X3aGC> .

wojtek-t · 2018-10-31T07:44:56Z

But if amount of data processed is the issue, then wouldn’t Tim’s
suggestion of a field selection make a bigger difference than you thought

Sure - but this I would estimate that it won't happen within next year or so.
And adding a separate component doing this aggregation can be done within a week or so.

lavalamp · 2018-10-31T16:54:24Z

Yes, if you want something that works today, building a mapping object and a controller is probably the way to go.

johnbelamaric · 2018-10-31T17:47:13Z

Sounds like a good enough reason to me!

…

On Wed, Oct 31, 2018 at 10:04 AM Daniel Smith ***@***.***> wrote: Yes, if you want something that works today, building a mapping object and a controller is probably the way to go. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2846 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJB4syBd1VKHG1G8gnh_u-YX8ZiV3Gwlks5uqdgPgaJpZM4X3aGC> .

chrisohaver · 2018-11-02T13:21:51Z

I assumed the PodLocator namespace and name would be 1:1 with Pod namespace and name. Does that answer your need?

OK, I misunderstood, I thought you were proposing the PodLocator to be 1:1 with nodes, with an array of pod IPs in each one.

thockin · 2018-11-02T15:36:12Z

@chrisohaver Resources with arrays in them have repeatedly turned out to be troublesome. We should not repeat that mistake. :)

chrisohaver · 2018-11-02T15:44:53Z

Resources with arrays in them have repeatedly turned out to be troublesome. We should not repeat that mistake. :)

Ah - I see now. The "IPs" array in your proposal is to hold dual stack ipv4/6 for a single Pod. ObjectMeta would hold the pod's namespace. So that works.

justinsb · 2018-11-04T02:09:08Z

I'm thinking that selective filtering need not be terribly complicated at first: kubernetes/kubernetes#70620

johnbelamaric · 2018-11-13T19:56:18Z

@m1093782566 I think we should update the KEP to reflect the creation of the PodLocator. The watch notification filtering could be a ways off and I don't want to make this dependent on it.

wojtek-t · 2018-11-14T08:31:30Z

I'm thinking that selective filtering need not be terribly complicated at first: kubernetes/kubernetes#70620

This helps with network traffic, but we still need to do filtering on apiserver side. And given that apiserved side is my main concern, I'm not entirely sure how much it would help.

m1093782566 · 2018-11-15T07:37:59Z

@johnbelamaric Sure. I will update the proposal in a few days - I am in KubeCon Shanghai Now.

justaugustus · 2018-11-20T04:35:39Z

REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.

m1093782566 · 2018-11-23T11:49:43Z

@thockin @johnbelamaric @wojtek-t

Per discussion, I just updated the KEP to reflect the creation of the PodLocator. Please take a look when you have chance. Thanks!

thockin · 2018-11-27T15:56:32Z

keps/sig-network/0031-service-topology.md

+
+Kube-proxy will watch its own node and will find the endpoints that are in the same topological domain as the node if `service.TopologyKeys` is not empty.
+
+Kube-proxy will watch PodLocator apart from Service and Endpoints. For each Endpoints object, kube-proxy will find the original Pod via EndpointAddress.TargetRef, therefore will get PodLocator object and its topology information. Kube-proxy will only create proxy rules for endpoints that are in the same topological domain as the node running kube-proxy.


Why does kube-proxy need PodLocator? It already knows its own Node.

I think kube-proxy still need PodLocator if it doesn't watch ALL Nodes(only knows its own Node is not sufficient) as it need to map from Endpoints.subsets[].addresses.nodeName -> PodLocator.nodeName -> PodLocator.nodeLabels - it's option (3) :)

See other (larger) comment wherein we discuss tradoffs of watching nodes vs podlocators.

thockin · 2018-11-27T15:57:19Z

keps/sig-network/0031-service-topology.md

+
+### DNS server changes
+
+We should consider this kind of topology support for headless service in coredns and kube-dns. As the DNS servers will respect topology keys for each headless service, different clients/pods on different nodes may get different dns response.


Is the plan to do this all at the same time? Or to alpha without this and add it for beta?

Or to alpha without this and add it for beta?

Sounds like a good plan, we can achieve this feature step by step. I will update this section and outline that headless service will be supported in beta stage.

keps/sig-network/0031-service-topology.md

justaugustus · 2018-12-01T08:03:42Z

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

k8s-ci-robot · 2018-12-01T08:03:43Z

@justaugustus: Closed this PR.

In response to this:

KEPs have moved to k/enhancements.
This PR will be closed and any additional changes to this KEP should be submitted to k/enhancements.
For more details on this change, review this thread.

Any questions regarding this move should be directed to that thread and not asked on GitHub.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

m1093782566 · 2018-12-03T09:25:09Z

I just created an PR in k/enhancements, you can continue to put comments there.

From my understanding, this proposal is almost ready for getting merged?

k8s-ci-robot requested review from caseydavenport and thockin October 24, 2018 08:39

k8s-ci-robot requested a review from johnbelamaric October 24, 2018 08:40

m1093782566 mentioned this pull request Oct 24, 2018

proposal: topology aware routing of services #1551

Closed

bowei reviewed Oct 24, 2018

View reviewed changes

m1093782566 force-pushed the topology-kep branch from 8d605e4 to da9255f Compare October 24, 2018 09:46

thockin assigned thockin and johnbelamaric Oct 24, 2018

thockin changed the title ~~Topology-aware service routing~~ KEP: Topology-aware service routing Oct 24, 2018

johnbelamaric reviewed Oct 24, 2018

View reviewed changes

thockin reviewed Oct 24, 2018

View reviewed changes

m1093782566 force-pushed the topology-kep branch from da9255f to d965b81 Compare October 26, 2018 12:32

m1093782566 force-pushed the topology-kep branch 2 times, most recently from 515e477 to c381546 Compare October 27, 2018 09:10

m1093782566 force-pushed the topology-kep branch from c381546 to c52c7e8 Compare October 27, 2018 09:16

m1093782566 mentioned this pull request Oct 31, 2018

kube-proxy loadbalancing: Use local pods only kubernetes/kubernetes#7433

Closed

justinsb mentioned this pull request Nov 5, 2018

WIP: PoC of watch notification filtering kubernetes/kubernetes#70620

Closed

wojtek-t mentioned this pull request Nov 5, 2018

Network latency SLI #2636

Merged

johnbelamaric mentioned this pull request Nov 15, 2018

Node-local services kubernetes/kubernetes#28610

Closed

4 tasks

m1093782566 force-pushed the topology-kep branch from c52c7e8 to 27be266 Compare November 23, 2018 11:48

thockin reviewed Nov 27, 2018

View reviewed changes

chrisohaver reviewed Nov 27, 2018

View reviewed changes

keps/sig-network/0031-service-topology.md Outdated Show resolved Hide resolved

k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Nov 28, 2018

m1093782566 force-pushed the topology-kep branch from aa09e63 to 1c94537 Compare November 29, 2018 04:07

add service topology kep

6f7d1f8

m1093782566 force-pushed the topology-kep branch from 1c94537 to 6f7d1f8 Compare November 29, 2018 04:10

k8s-ci-robot closed this Dec 1, 2018

m1093782566 mentioned this pull request Dec 3, 2018

KEP: Topology-aware service routing kubernetes/enhancements#640

Merged


		Figure out a generic way to implement the "local service" route, say "topology aware routing of service".

		Locality is defined by user, it can be any topology-related thing. "Local" means the "same topology level", e.g. same node, same rack, same failure zone, same failure region, same cloud provider etc.


		Endpoint Controller will populate the `Topology` for each `EndpointAddress`. We want `EndpointAddress.Topology` to tell the LB, such as kube-proxy what topological domain(e.g. host, rack, zone, region etc.) the endpoints is in.

		Endpoints controller will need to watch two extra resources: ServicePolicy and Nodes. Watching ServicePolicy for knowing what services have topological preferences. Watching Nodes for knowing labels of node hosting the endpoint and copy the node labels referenced in the service spec's topology constraints to EndpointAddress.


		Although `NodeName` was already added to `EndpointAddress`, we want `Endpoints` to carry more node's topology informations so that allowing more topology levels other than hostname.

		So, create a new `Topology` field in `Endpoints.Subsets.Addresses` for identifying what topology domain the endpoints pod exists, e.g. what host, rack, zone, region etc. In other words, copy the topology-related labels of node hosting the endpoint to `EndpointAddress.Topology`.


		### Non-goals

		Scheduler spreading to implement this sort of topology guarantee.


		The user need a way to declare which service is "local service" and what is the "topology key" of "local service".

		This will be accomplished through a new type object `ServicePolicy`.


		In our example, services in namespace `foo` with label `app=bar` will be chosen. Requests to these services will be routed only to backends on nodes with the same value for `kubernetes.io/hostname` as the originating pod's node. If we want the "same host", probably every host should carry unique `kubernetes.io/hostname` labels.

		We can configure multiple `ServicePolicy` targeting a single service. In this case, the service will carry multiple topology requirements and the relationships of all the requirements are logical `AND`, for example,


		Kube-proxy will watch its own node and will find the endpoints that are in the same topological domain as the node if `service.TopologyKeys` is not empty.

		Kube-proxy will watch PodLocator apart from Service and Endpoints. For each Endpoints object, kube-proxy will find the original Pod via EndpointAddress.TargetRef, therefore will get PodLocator object and its topology information. Kube-proxy will only create proxy rules for endpoints that are in the same topological domain as the node running kube-proxy.


		### DNS server changes

		We should consider this kind of topology support for headless service in coredns and kube-dns. As the DNS servers will respect topology keys for each headless service, different clients/pods on different nodes may get different dns response.

KEP: Topology-aware service routing #2846

KEP: Topology-aware service routing #2846

Conversation

m1093782566 commented Oct 24, 2018 • edited Loading

m1093782566 commented Oct 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin Oct 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjhaid commented Oct 24, 2018

johnbelamaric left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johnbelamaric Oct 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

johnbelamaric Oct 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin Oct 24, 2018 • edited Loading

Choose a reason for hiding this comment

m1093782566 commented Oct 25, 2018

johnbelamaric commented Oct 25, 2018

thockin commented Oct 25, 2018

k8s-ci-robot commented Oct 26, 2018

m1093782566 commented Oct 27, 2018

johnbelamaric commented Oct 31, 2018 via email

wojtek-t commented Oct 31, 2018

lavalamp commented Oct 31, 2018

johnbelamaric commented Oct 31, 2018 via email

chrisohaver commented Nov 2, 2018

thockin commented Nov 2, 2018

chrisohaver commented Nov 2, 2018

justinsb commented Nov 4, 2018

johnbelamaric commented Nov 13, 2018

wojtek-t commented Nov 14, 2018

m1093782566 commented Nov 15, 2018

justaugustus commented Nov 20, 2018

m1093782566 commented Nov 23, 2018

Choose a reason for hiding this comment

m1093782566 Nov 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m1093782566 Nov 28, 2018 • edited Loading

Choose a reason for hiding this comment

justaugustus commented Dec 1, 2018

k8s-ci-robot commented Dec 1, 2018

m1093782566 commented Dec 3, 2018

m1093782566 commented Oct 24, 2018 •

edited

Loading

thockin Oct 24, 2018 •

edited

Loading

m1093782566 Oct 25, 2018 •

edited

Loading

johnbelamaric Oct 26, 2018 •

edited

Loading

m1093782566 Oct 25, 2018 •

edited

Loading

m1093782566 Oct 25, 2018 •

edited

Loading

johnbelamaric Oct 25, 2018 •

edited

Loading

thockin Oct 24, 2018 •

edited

Loading

m1093782566 Nov 28, 2018 •

edited

Loading

m1093782566 Nov 28, 2018 •

edited

Loading