Skip to content

KEP: Topology-aware service routing #2846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion keps/NEXT_KEP_NUMBER
Original file line number Diff line number Diff line change
@@ -1 +1 @@
31
32
168 changes: 168 additions & 0 deletions keps/sig-network/0031-service-topology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
---
kep-number: 31
title: Topology-aware service routing
status: Pending
authors:
- "@m1093782566"
owning-sig: sig-network
reviewers:
- "@thockin"
- "@johnbelamaric"
approvers:
- "@thockin"
creation-date: 2018-10-24
last-updated: 2018-10-26
---

# Topology-aware service routing

## Table of Contents

* [Motivation](#motivation)
* [Goals](#goals)
* [Non\-goals](#non-goals)
* [User cases](#user-cases)
* [Background](#background)
* [Proposal](#proposal)
* [Implementation History](#implementation-history)
* [Service API changes](#service-api-changes)
* [Endpoints API changes](#endpoints-api-changes)
* [Endpoints Controller changes](#endpoints-controller-changes)
* [Kube-proxy changes](#kube-proxy-changes)
* [DNS changes](#dns-changes)
* [CoreDNS changes](#coredns-changes)
* [Kube-dns changes](#kube-dns-changes)

## Motivation

Figure out a generic way to implement the "local service" route, say "topology aware routing of service".

Locality is defined by user, it can be any topology-related thing. "Local" means the "same topology level", e.g. same node, same rack, same failure zone, same failure region, same cloud provider etc. Two nodes are considered "local" if they have the same value for a particular label, called the "topology key".

### Goals

A generic way to support topology aware routing of services in arbitrary topological domains, e.g. node, rack, zone, region, etc. by node labels.

### Non-goals

* Scheduler spreading to implement this sort of topology guarantee
* Dynamic Availability
* Health-checking
* Capacity-based or load-based spillover

### User cases

* Logging agents such as fluentd. Deploy fluentd as DaemonSet and applications only need to communicate with the fluentd in the same node.
* For a sharded service that keeps per-node local information in each shard.
* Authenticating proxies such as [aws-es-proxy](https://github.com/kopeio/aws-es-proxy).
* In container identity wg, being able to give daemonset pods a unique identity per host is on the 2018 plan, and ensuring local pods can communicate to local node services securely is a key goal there. -- from @smarterclayton
* Regional data costs in multi-AZ setup - for instance, in AWS, with a multi-AZ setup, half of the traffic will switch AZ, incurring regional data Transfer costs, whereas if something was local, it wouldn't hit the network.
* Performance benefit (node local/rack local) is lower latency/higher bandwidth.

### Background

It's a pain point for multi-zone clusters deployment since cross-zone network traffic being charged, while in-zone is not. In addition, cross-node traffic may carry sensitive metadata from other nodes. Therefore, users always prefer the service backends that close to them, e.g. same zone, rack and host etc. for security, performance and cost concerns.

Kubernetes scheduler can constraining a pod to only be able to run on particular nodes/zones. However, Kubernetes service proxy just randomly picks an available backend for service routing and this one can be very far from the user, so we need a topology-aware service routing solution in Kubernetes. Basically, to find the nearest service backend. In other words, allowing people to configure if ALWAY reach a to local service backend. In this way, they can reduce network latency, improve security, save money and so on. However, because topology is arbitrary, zone, region, rack, generator, whatever, who knows? We should allow arbitrary locality.

`ExternalTrafficPolicy` was added in v1.4, but only for NodePort and external LB traffic. NodeName was added to `EndpointAddress` to allow kube-proxy to filter local endpoints for various future purposes.

Based on our experience of advanced routing setup and recent demo of enabling this feature in Kubernetes, this document would like to introduce a more generic way to support arbitrary service topology.

## Proposal

This proposal builds off of earlier requests to [use local pods only for kube-proxy loadbalancing](https://github.com/kubernetes/kubernetes/issues/7433) and [node-local service proposal](https://github.com/kubernetes/kubernetes/pull/28637). But, this document proposes that not only the particular "node-local" user case should be taken care, but also a more generic way should be figured out.

Locality is an "user-defined" thing. When we set topology key "hostname" for service, we expect node carries different node labels on the key "hostname".

Users can control the level of topology. For example, if someone run logging agent as a daemonset, he can set the "hard" topology requirement for same-host. If "hard" is not met, then just return "service not available".

And if someone set a "soft" topology requirement for same-host, say he "preferred" same-host endpoints and can accept other hosts when for some reasons local service's backend is not available on some host.

If multiple endpoints satisfy the "hard" or "soft" topology requirement, we will randomly pick one by default.

Routing decision is expected to be implemented by kube-proxy and kube-dns/coredns for headless service.


## Implementation history

### Service API changes

Users need a way to declare what service is local and the definition of local backends for the particular service.

In this proposal, we give the service owner a chance to configure the service locality things. A new property would be introduced to `ServiceSpec`, say `topologyKeys` - it's a string slice and should be optional.

```go
type ServiceSpec struct {
// topologyKeys is a preference-order list of topology keys. If backends exist for
// index [0], they will always be chosen; only if no backends exist for index [0] will backends for index [1] be considered.
// If this field is specified and all indices have no backends, the service has no backends, and connections will fail. We say these requirements are hard.
// In order to experss soft requirement, we may give a special node label key "" as it means "match all nodes".
TopologyKeys []string `json:"topologyKeys" protobuf:"bytes,1,opt,name=topologyKeys"`
}
```

An example of `Service` with topology keys:

```
kind: Service
metadata:
name: service-local
spec:
topologyKeys: ["host", "zone"]
```


In our example above, we will firstly try to find the backends in the same host. If no backends match, we will then try the lucky of same zone. If finally we can't find any backends in the same host or same zone, then we say the service has no satisfied backends and connections will fail.

If we configure topologyKeys as `["host", ""]`, we just do the effort to find the backends in the same host and will not fail the connection if no matched backends found.

### New PodLocator resource

As EndpointAddress already contains nodeName field, we can build a service that will precook Pod to its topologies mapping. Then let all interested components(at least kube-proxy and kube-dns and coredns) just watch that precooked object and do necessary mapping internally. Given that we don't know which labels are topology labels, we are going to copy all node labels.

```
// PodLocator represents information about where a pod exists in arbitrary space. This is useful for things like
// being able to reverse-map pod IPs to topology labels, without needing to watch all Pods or all Nodes.
type PodLocator struct {
metav1.TypeMeta
// +optional
metav1.ObjectMeta

// NOTE: Fields in this resource must be relatively small and relatively low-churn.

IPs []PodIPInfo // being added for dual-stack support
NodeName string
NodeLabels map[string]string
}
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may visibly increase size of Endpoints object, which (due to the fact that all kube-proxies are watching all endpoints objects) is causing a bunch of scalability/performance related issues.

So let me suggest a bit different approach (which is a bit inline with what @thockin already mentioned in a different comment):

  • EndpointAddress already contains nodeName field
  • I think we can safely assume that that topologies for a given node doesn't change frequently
    So what if we would:
  • build a service that will be precooking node-> its topologies mapping
  • this pretty much won't be changing at all
  • let all interested components just watch that precooked object and do necessary mapping internally?

@thockin @johnbelamaric for thoughts

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, with your suggested new approach, seems we don't need the Topologies filed in Endpoints object and Endpoints controller will not need to watch nodes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - though, before you change anything, I would like to hear others opinions about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would mean though that kube-proxy needs to watch this new object, in order to be able to map nodename -> topology and apply the filtering. Is that really better than adding a field to endpointaddress?

Adding the field means that we have one more watch (nodes) from the endpoints controller, versus one per node from kube-proxy.

Also, note that we do NOT need to copy all topology labels to all endpoint addresses. Since we know the topology labels that are needed to make the service policy decision for each specific endpoint, we only need to copy those to the endpoint addresses. So, if someone does not define any service policy, then Topologies is nil. If someone defines a service policy for "node local", then the kubernetes.io/hostname label gets copied to each endpoint address for that service. No other labels need to get copied.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the field means that we have one more watch (nodes) from the endpoints controller, versus one per node from kube-proxy.

Number of watches itself is not the issue.
It's amount of data that needs to be processed and send is what is causing issues.

So the key factors here is:

  • the topology labels are changing extremely rarely - so the amount of watch events to send here will be negligible
  • if we add those to the endpoints, endpoints are changing frequently and an update of a single backend of the service means that the whole endpoints object (containing all endpoints with topology keys etc) are send to all kube-proxies.

That seems to be waaaay more expensive.

Also, note that we do NOT need to copy all topology labels to all endpoint addresses.

Sure - i get that. But we can't assume that people won't be using that feature so we don't need to care about performance.

I spoke briefly with @lavalamp and he seems to think option 5 was viable.

What i described here is kind of extension of that.

@thockin ^^

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I don't think we can avoid PodLocator. I agree that DNS needs it, at least.

The question for me is whether kube-proxy uses PodLocator (which I think is risky because high-churn clusters could be worse than watching nodes) or NodeLocator (which I really don't want to implement because it feels like a hack) or Nodes (which @wojtek-t correctly ruled out).

There must be a CPU efficient way to do metadata-only watches, but it may be a big enough redesign that it's simply infeasible. E.g. Brian suggested storing metadata as a separate resource in etcd. I'm not sure that this problem (avoiding NodeLocator or NodeMeta as a resource) rises to the level of requiring such a large change, but maybe worth considering.

I expect that decomposing all objects into storage is a HUGE effort, but what if every every write operation to an object key also wrote to a parallel (hidden) key in etcd which was just the metadata of that object? Then watching metadata could be less costly? Are we open to using transactions in etcd for this?

I'm just trying to avoid setting what I think i s a really unfortunate precedent, just because we have insufficient infrastructure to do the right thing...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than short-lived pods, the PodLocator should be rarely changing, which was kind of the point behind it. Certainly for alpha, I would think we can implement the PodLocator, then we can get some data on the performance impact for having kube-proxy watch this in clusters with many short-lived pods.

I agree most of @johnbelamaric opinions. If we want to make this proposal simple, PodLocator should be a good start point.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin - I guess I may be a bit too paranoid here...
I think that splitting object metadata to a separate object in etcd would work, but it's hell a lot of work. But fortunately, we are serving watches (by default) from apiserver anyway using watchcache.
So actually I can imagine, customizing it a bit more and allowing producing a separate stream for just metadata. What I was afraid of is that given that there will be multiple (equal to number of nodes) watchers will be watching the same. But those won't have any selectors etc, so we should be able to do all the computations once (i.e. check if something changed).
The question now is: whether we send all updates of Node metadata or only those where something really changed (note that RV always changes). If all, that is really huge amount of data, which is not going to work in my opinion...

But I think that with some additional work in apiserver storage layer, it should be possible to do it somewhat efficiently without any huge changes. So I think it may actually work.

The argument that I still don't fully buy is why we are afraid of watching all PodLocators from all nodes. The amount of data (at least as of now) would still be smaller than amount of data from nodes (assuming we send out all update not only "real updates"). But probably if we start thinking more about the future and the goals to support much higher churn, it may no longer be the case. So I actually probably start buying this argument (at least to some extent).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m1093782566 I am OK to start by using PodLocator in kube-proxy, but we shouldarrange teh code such that it could be replaced with Node metadata if/when we can make that work.

@wojtek-t

I think that splitting object metadata to a separate object in etcd would work, but it's hell a lot of work

NB I was advocating replicating it (storing twice) not splitting, which I HOPED would be less work :)

The question now is: whether we send all updates of Node metadata or only those where something really changed (note that RV always changes)

I think we'd have to filter out RV changes that don't affect the metadata itself

But I think that with some additional work in apiserver storage layer, it should be possible to do it somewhat efficiently without any huge changes.

This is what I was hoping you'd eventually say if I threw enough bad ideas at you. We can start with PodLocator, but I'd really like to get metadata watches onto the plan as a general thing.

why we are afraid of watching all PodLocators from all nodes

Number of PodLocators is going to (generally) going to scale with cluster size. E.g a 5k node cluster with 150k Pods has 150k PodLocators. If each node churns 1 pod per 10 seconds, that's 500 QPS to 5000 nodes == 2.5M QPS through API server for literally no value -- all kube-proxy really needs is node labels, which probably changes a few times per DAY in an autoscaled cluster.

So as above, I am ok to use PodLocator to get started on the feature, but I think it will explode in high-scale clusters, so we'll either need efficient metadata watches or we'll need a NodeLocator pre-cooker.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thockin

Definitely!


In order to reference PodLocator back to a Pod easily, PodLocator namespace and name would be 1:1 with Pod namespace and name. In other words, PodLocator is a lightweight object which stores Pod location/topology information

### New PodLocator controller

A new PodLocator controller will watch and cache all Pods and Nodes. Then pre-cook pod name to {pod IPs, node name, node labels} mapping.

When a Pod is added, PodLocator controller will created a new PodLocator object whose namespace and name are 1:1 with Pod namespace and name. Then it will populate the Pod's IP(s), node name and labels into the new object.

When a Pod is updated, PodLocator controller will first check if IPs or Spec.NodeName are changed. If changed, PodLocator controller will update the corresponding PodLocator object accordingly, otherwise will ignore this change.

When a Pod is deleted, PodLocator controller will delete the corresponding PodLocator object.

When a Node is updated, PodLocator controller will first check if its labels are changed. If changed, will update all the PodLocators whose corresponding Pods running on it.

When a Node is deleted, PodLocator controller will reset the NodeName and NodeLabels of all the PodLocators whose corresponding Pods running on it.

### Kube-proxy changes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this impact kube-dns as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support it for headless services, it needs to be supported in DNS as well, we can add it to CoreDNS when the time comes.
cc: @chrisohaver @fturib

Copy link
Member

@thockin thockin Oct 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question - it would be somewhat complex for DNS to do split-horizon responses based on client IP, but that is the implication of this (for headless services, anyway). @johnbelamaric
especially since client IP -> Node mapping is not always as easy as a CIDR (some people do not use that and assign arbitrary /32s or multiple /28s)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires watching pods which isn't something we would want to do in general, and not something we would want to adapt kube-dns to do. CoreDNS can optionally do that already so if the customer is willing to accept the cost of that we can do it for headless.

For purposes of this KEP we can make is clusterIP services only, and then the headless version can become a feature of CoreDNS maybe. If we do that though, should we enforce anything in the API if the user configures topology for a headless service (assuming we go the route of adding this to service rather than a separate resource)? Otherwise it could cause expectations that aren't really being met.

Copy link
Author

@m1093782566 m1093782566 Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I missed the headless service part in the proposal, will take a deep look at the effect.

For other kinds of service, I assume there is no impact on dns?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document is updated to include the DNS part changes.


Kube-proxy will respect topology keys for each service, so kube-proxy on different nodes may create different proxy rules.

Kube-proxy will watch its own node and will find the endpoints that are in the same topological domain as the node if `service.TopologyKeys` is not empty.

Kube-proxy will watch PodLocator apart from Service and Endpoints. For each Endpoints object, kube-proxy will find the original Pod via EndpointAddress.TargetRef, therefore will get PodLocator object and its topology information. Kube-proxy will only create proxy rules for endpoints that are in the same topological domain as the node running kube-proxy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does kube-proxy need PodLocator? It already knows its own Node.

Copy link
Author

@m1093782566 m1093782566 Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think kube-proxy still need PodLocator if it doesn't watch ALL Nodes(only knows its own Node is not sufficient) as it need to map from Endpoints.subsets[].addresses.nodeName -> PodLocator.nodeName -> PodLocator.nodeLabels - it's option (3) :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other (larger) comment wherein we discuss tradoffs of watching nodes vs podlocators.


### DNS server changes (in beta stage)

We should consider this kind of topology support for headless service in coredns and kube-dns. As the DNS servers will respect topology keys for each headless service, different clients/pods on different nodes may get different dns response.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan to do this all at the same time? Or to alpha without this and add it for beta?

Copy link
Author

@m1093782566 m1093782566 Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to alpha without this and add it for beta?

Sounds like a good plan, we can achieve this feature step by step. I will update this section and outline that headless service will be supported in beta stage.


In order to handle headless services, the DNS server needs to know the node corresponding to the client IP address in the DNS request - i.e, it needs to map PodIP -> Node. Kubernetes DNS servers(include kube-dns and CoreDNS) will watch PodLocator object. When a client/pod request a headless service domain to DNS server, dns server will retrieve the node labels of both client and the backend Pods via PodLocator. DNS server will only select the IPs of backend Pods which are in the same topological domain with client Pod, and then write A record.