-
Notifications
You must be signed in to change notification settings - Fork 5.3k
KEP: Topology-aware service routing #2846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// "preferred" is the "soft" requirement for topology key and an example would be | ||
// "prefer to visit service backends in the same rack, but OK to other racks if none match" | ||
// +optional | ||
Mode ServicetopologyMode `json:"mode" protobuf:"bytes,1,opt,name=mode"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing definition for ServiceTopologyMode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, ServiceTopologyMode is a string type.
ObjectMeta | ||
|
||
// specification of the topology policy of this ServicePolicy | ||
Spec TopologyPolicySpec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are to have other ServicePolicy in future, does it make sense to have
Spec ServicePolicySpec
then
type ServicePolicySpec struct {
Topology ToplogyPolicySpec
// Room in future for other things
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, make sense. The form you suggested has better scalability.
done | ||
``` | ||
|
||
### Kube-proxy changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this impact kube-dns as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support it for headless services, it needs to be supported in DNS as well, we can add it to CoreDNS when the time comes.
cc: @chrisohaver @fturib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - it would be somewhat complex for DNS to do split-horizon responses based on client IP, but that is the implication of this (for headless services, anyway). @johnbelamaric
especially since client IP -> Node mapping is not always as easy as a CIDR (some people do not use that and assign arbitrary /32s or multiple /28s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires watching pods which isn't something we would want to do in general, and not something we would want to adapt kube-dns to do. CoreDNS can optionally do that already so if the customer is willing to accept the cost of that we can do it for headless.
For purposes of this KEP we can make is clusterIP services only, and then the headless version can become a feature of CoreDNS maybe. If we do that though, should we enforce anything in the API if the user configures topology for a headless service (assuming we go the route of adding this to service rather than a separate resource)? Otherwise it could cause expectations that aren't really being met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed the headless service part in the proposal, will take a deep look at the effect.
For other kinds of service, I assume there is no impact on dns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The document is updated to include the DNS part changes.
8d605e4
to
da9255f
Compare
I love this, any thoughts around allowing weighting of preference, more importantly in the case of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving this to a KEP. I think if we streamline this to not add new API kinds it will reduce the scope and make it easier to move forward.
done | ||
``` | ||
|
||
### Kube-proxy changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support it for headless services, it needs to be supported in DNS as well, we can add it to CoreDNS when the time comes.
cc: @chrisohaver @fturib
|
||
Figure out a generic way to implement the "local service" route, say "topology aware routing of service". | ||
|
||
Locality is defined by user, it can be any topology-related thing. "Local" means the "same topology level", e.g. same node, same rack, same failure zone, same failure region, same cloud provider etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two nodes are considered "local" if they have the same value for a particular label, called the "topology key".
mode: required | ||
``` | ||
|
||
In our example, services in namespace `foo` with label `app=bar` will be dominated by both `service-policy-example-1` and `service-policy-example-2`. Requests to these services will be routed only to backends that satisfy both same region and same switch as kube-proxy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if it said region required and switch preferred, how would that look in the eventual rules from kube-proxy? How would I express:
prefer the local node, but allow any service backend if there are none on the local node
Endpoint(s) with specify label will be selected by label selector in | ||
`ServicePolicy`, and `ServicePolicy` will declare the topology policy for those endpoints. | ||
|
||
Cluster administrators can configure what services are "local" and what topological they prefer via `ServicePolicy`. `ServicePolicy` is a namespace-scope resource and is strict optional. We can configure policies other than topological reference in `ServicePolicy`, but this proposal will not cover them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the ServicePolicy separate but in general I think there is an effort to avoid adding new core API kinds. Adding this as an optional topology
field in Service with a list of key
, mode
pairs may be easier to move forward.
|
||
Endpoint Controller will populate the `Topology` for each `EndpointAddress`. We want `EndpointAddress.Topology` to tell the LB, such as kube-proxy what topological domain(e.g. host, rack, zone, region etc.) the endpoints is in. | ||
|
||
Endpoints controller will need to watch two extra resources: ServicePolicy and Nodes. Watching ServicePolicy for knowing what services have topological preferences. Watching Nodes for knowing labels of node hosting the endpoint and copy the node labels referenced in the service spec's topology constraints to EndpointAddress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I see that's what you're suggesting here. If we push ServicePolicy into ServiceSpec then we don't need the extra watch for ServicePolicy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but we still need to watch node. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, endpoints controller will need to watch all nodes. kube-proxy will only need to know about (and watch) its own node (if it doesn't already).
|
||
Although `NodeName` was already added to `EndpointAddress`, we want `Endpoints` to carry more node's topology informations so that allowing more topology levels other than hostname. | ||
|
||
So, create a new `Topology` field in `Endpoints.Subsets.Addresses` for identifying what topology domain the endpoints pod exists, e.g. what host, rack, zone, region etc. In other words, copy the topology-related labels of node hosting the endpoint to `EndpointAddress.Topology`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topology is arbitrary. How do we decide what data to use to populate this?
a) It's the entire node.metadata.labels
b) It's from a new field on node, similar to labels but distinct for topology
c) it's a subset of node labels (e.g. a known prefix)
d) some rule copies topology-related labels from node->pod (and then repeat this question)
e) only copying the labels that we know are needed by the topology constraints (e.g. service slices on foo.com/rack, so we copy that label from nodes -> endpoints) ?
Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below it suggests copying the labels based on which ones are used for the ServicePolicy that select this service. I think if we put the keys in the service spec then that's fairly simple to do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Voting e).
|
||
### Non-goals | ||
|
||
Scheduler spreading to implement this sort of topology guarantee. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-goal: dynamic availability, health-checking, capacity-based or load-based spillover
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just say: Definition of topology other than by node labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to be clear to readers that we're not requiring dynamic balancing here, which would be a major departure.
|
||
The user need a way to declare which service is "local service" and what is the "topology key" of "local service". | ||
|
||
This will be accomplished through a new type object `ServicePolicy`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if ServicePolicy is too complicated for this. I thought for a while we would need a lot more configurability, but now that we have reduced it to 1 or 2 fields, I am not sure this meets the bar for a new resource. We need to majorly refactor Service and Endpoints anyway, so maybe it's simpler to just be simpler for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. Just add a single field to ServiceSpec
:
// Preference-order list of topology keys. How this list is interpreted is implementation-defined.
// E.g. a valid implementation might be to treat this field as a hard constraint: if backends exist for
// index [0], they will always be chosen; only if no backends exist for index [0] will backends
// for index [1] be considered. Another valid implementation might be to monitor load and
// "spill over" into less-preferred topologies dynamically.
//
// If this field is specified and all indices have no backends, the service has no backends, and
// connections will fail.
Topologies []string
(the field name can be debated)
We could argue that we need the "fall back to global" mode option (hard/soft), but even that could be satisfied by simply labeling all nodes or maybe by allowing "" as a special case meaning "match all nodes".
Less API is more good. How little can we get away with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can get rid of the hard/soft by the "" meaning match all. But it seems like a bit of a "magic" value. There is logic to it, but it's not explicit.
So we have:
topologyKey: ["node", "zone", ""]
versus
topology:
- key: node
mode: preferred
- key: zone
mode: preferred
🤔
In the former, which is nice and compact, you have this implicit logic:
- all values are a preference except the last value which is a requirement
And that is really the ONLY logic that makes sense. In the latter you can express meaningless things like:
topology:
- key: node
mode: required
- key: zone
mode: required
In fact since topology is arbitrary, you could come up with rules that are not satisfied by any endpoint, although that would be a pretty pathological case.
So...I am leaning towards the first one. I wonder though if the "hard" requirement of the last key should not be the default. That is, which would be least surprising to the user - to restrict to only those topologies unless they add the magic value, or to fallback to any endpoint unless there is some other magic value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slice of topology key would be simpler in API and we can get ride of the new resource. But seems it's not straightforward for me to express HARD requirement. Can you give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard:
topologyKey: ["kubernetes.io/hostname"]
Soft:
topologyKey: ["kubernetes.io/hostname", ""]
This seems pretty natural. But my point above is asking which is more common, or more expected, hard or soft? If we want it to be soft unless the user is explicit about wanting it be a hard requirement (which is "safer" in a sense), then we need to treat the first one as soft and do something to flag that it's hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be worthwhile to solicit more input on user expectations for this, great point.
My assumption (with little data) is that "hard" is the more common case. Most use cases have been "give me my node-local agent".
It's not that I am making the last value special in its last-ness. I am saying that (yeah, it's a little ugly) "" is an implicit label that every node satisfies. E.g. if I wrote:
topologyKey: ["","kubernetes.io/hostname"]
Callers would never ever evaluate the "hostname" key because the "" key would match any node, and should never be empty (except in a 0-node cluster).
So yeah, it's "special" but by it's meaning, not it's position.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that it's not position, and I think it's easy to figure out what it means when you see it. I just think it may be non-obvious that in order to have it match "any" you need an empty string in there. If the name were topologyConstraints
or topologiesAllowed
or something similar (it's really just labels not inherently "topology") then my concern goes away. The topologyKey
name comes from the podspec, which has additional context beyond just that name.
Endpoint(s) with specify label will be selected by label selector in | ||
`ServicePolicy`, and `ServicePolicy` will declare the topology policy for those endpoints. | ||
|
||
Cluster administrators can configure what services are "local" and what topological they prefer via `ServicePolicy`. `ServicePolicy` is a namespace-scope resource and is strict optional. We can configure policies other than topological reference in `ServicePolicy`, but this proposal will not cover them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the crux for me - why would cluster admins decide this? It needs to be the same people deciding the delpolyment (e.g, if I deploy zonal, I want service zonal).
This does raise a point that we really do want Deployment to be more topology aware, so I can have N instances in each zone. I know there's a KEP for that, don't know the status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we have discussed the question on if service policy should be part of service. If we configure it outside of service in a more admin-owned way, it looks more like Istio.
Probably users(not only cluster admins) should be able to configure the service policy? After all, service policy is namespace-scoped in this proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I don't think it's an admin thing, it's a service owner thing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. If we had a bunch of policies we wanted to configure or if it was sufficiently complicated, a new resource might be justified, but I don't think it is.
|
||
In our example, services in namespace `foo` with label `app=bar` will be chosen. Requests to these services will be routed only to backends on nodes with the same value for `kubernetes.io/hostname` as the originating pod's node. If we want the "same host", probably every host should carry unique `kubernetes.io/hostname` labels. | ||
|
||
We can configure multiple `ServicePolicy` targeting a single service. In this case, the service will carry multiple topology requirements and the relationships of all the requirements are logical `AND`, for example, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This semantic is pretty complicated, and is one of the drawbacks of the new-resource approach, IMO.
Do we really think there's enough of a use-case where policy does not map 1:1 with service?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends, most of my customers only use one service policy, look like:
-
host-only or deny all
-
prefer same zone then round robin.
done | ||
``` | ||
|
||
### Kube-proxy changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - it would be somewhat complex for DNS to do split-horizon responses based on client IP, but that is the implication of this (for headless services, anyway). @johnbelamaric
especially since client IP -> Node mapping is not always as easy as a CIDR (some people do not use that and assign arbitrary /32s or multiple /28s)
Thanks @thockin @johnbelamaric @bowei I will take a deep look at the landscape of not adding new kind API and the impact on headless service. Hopefully can finish it soon. |
@m1093782566 Great, but don't fret the headless piece too much, I think it would add too much scope to make it part of this. I don't think you can do it without either knowing the source IP -> node mapping, or having a node-local agent filter the IPs from the DNS response (and it would have to know which of those are local). |
I think headless and DNS need to be considered - it might be doable to implement as a followup, but I think we need to know if/how to solve it. |
da9255f
to
d965b81
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
515e477
to
c381546
Compare
Per the discussion on API and headless service, I just pushed an update. PTAL. Thanks! @johnbelamaric Please feel free to correct me especially in the DNS part. |
c381546
to
c52c7e8
Compare
Ok, if number of watches is not a concern then I can see your point.
But if amount of data processed is the issue, then wouldn’t Tim’s
suggestion of a field selection make a bigger difference than you thought?
Especially if we can tweak watches to only fire updates if it is one of the
specific watches fields that changes?
…On Wed, Oct 31, 2018 at 12:25 AM Wojciech Tyczynski < ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In keps/sig-network/0031-service-topology.md
<#2846 (comment)>:
> +In our example above, we will firstly try to find the backends in the same host. If no backends match, we will then try the lucky of same zone. If finally we can't find any backends in the same host or same zone, then we say the service has no satisfied backends and connections will fail.
+
+If we configure topologyKeys as `["host", ""]`, we just do the effort to find the backends in the same host and will not fail the connection if no matched backends found.
+
+### Endpoints API changes
+
+Although `NodeName` was already added to `EndpointAddress`, we want `Endpoints` to carry more node's topological informations so that allowing more topology information other than hostname.
+
+This proposal will create a new `Topologies` field in `Endpoints.Subsets.Addresses` for identifying what topological domain the backend pod exists.
+
+```go
+type EndpointAddress struct {
+ // labels of node hosting the endpoint
+ Topologies map[string]string
+}
+```
Adding the field means that we have one more watch (nodes) from the
endpoints controller, versus one per node from kube-proxy.
Number of watches itself is not the issue.
It's amount of data that needs to be processed and send is what is causing
issues.
So the key factors here is:
- the topology labels are changing extremely rarely - so the amount of
watch events to send here will be negligible
- if we add those to the endpoints, endpoints are changing frequently
and an update of a single backend of the service means that the whole
endpoints object (containing all endpoints with topology keys etc) are send
to all kube-proxies.
That seems to be waaaay more expensive.
Also, note that we do NOT need to copy all topology labels to all endpoint
addresses.
Sure - i get that. But we can't assume that people won't be using that
feature so we don't need to care about performance.
I spoke briefly with @lavalamp <https://github.com/lavalamp> and he seems
to think option 5 was viable.
What i described here is kind of extension of that.
@thockin <https://github.com/thockin> ^^
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2846 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJB4s2mujIs-oJVoV7Scql6fWYbPf-r6ks5uqVBhgaJpZM4X3aGC>
.
|
Sure - but this I would estimate that it won't happen within next year or so. |
Yes, if you want something that works today, building a mapping object and a controller is probably the way to go. |
Sounds like a good enough reason to me!
…On Wed, Oct 31, 2018 at 10:04 AM Daniel Smith ***@***.***> wrote:
Yes, if you want something that works today, building a mapping object and
a controller is probably the way to go.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2846 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJB4syBd1VKHG1G8gnh_u-YX8ZiV3Gwlks5uqdgPgaJpZM4X3aGC>
.
|
OK, I misunderstood, I thought you were proposing the PodLocator to be 1:1 with nodes, with an array of pod IPs in each one. |
@chrisohaver Resources with arrays in them have repeatedly turned out to be troublesome. We should not repeat that mistake. :) |
Ah - I see now. The "IPs" array in your proposal is to hold dual stack ipv4/6 for a single Pod. ObjectMeta would hold the pod's namespace. So that works. |
I'm thinking that selective filtering need not be terribly complicated at first: kubernetes/kubernetes#70620 |
@m1093782566 I think we should update the KEP to reflect the creation of the PodLocator. The watch notification filtering could be a ways off and I don't want to make this dependent on it. |
This helps with network traffic, but we still need to do filtering on apiserver side. And given that apiserved side is my main concern, I'm not entirely sure how much it would help. |
@johnbelamaric Sure. I will update the proposal in a few days - I am in KubeCon Shanghai Now. |
REMINDER: KEPs are moving to k/enhancements on November 30. Please attempt to merge this KEP before then to signal consensus. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
c52c7e8
to
27be266
Compare
@thockin @johnbelamaric @wojtek-t Per discussion, I just updated the KEP to reflect the creation of the PodLocator. Please take a look when you have chance. Thanks! |
|
||
Kube-proxy will watch its own node and will find the endpoints that are in the same topological domain as the node if `service.TopologyKeys` is not empty. | ||
|
||
Kube-proxy will watch PodLocator apart from Service and Endpoints. For each Endpoints object, kube-proxy will find the original Pod via EndpointAddress.TargetRef, therefore will get PodLocator object and its topology information. Kube-proxy will only create proxy rules for endpoints that are in the same topological domain as the node running kube-proxy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does kube-proxy need PodLocator? It already knows its own Node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think kube-proxy still need PodLocator if it doesn't watch ALL Nodes(only knows its own Node is not sufficient) as it need to map from Endpoints.subsets[].addresses.nodeName
-> PodLocator.nodeName
-> PodLocator.nodeLabels
- it's option (3) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other (larger) comment wherein we discuss tradoffs of watching nodes vs podlocators.
|
||
### DNS server changes | ||
|
||
We should consider this kind of topology support for headless service in coredns and kube-dns. As the DNS servers will respect topology keys for each headless service, different clients/pods on different nodes may get different dns response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to do this all at the same time? Or to alpha without this and add it for beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or to alpha without this and add it for beta?
Sounds like a good plan, we can achieve this feature step by step. I will update this section and outline that headless service will be supported in beta stage.
aa09e63
to
1c94537
Compare
1c94537
to
6f7d1f8
Compare
KEPs have moved to k/enhancements. Any questions regarding this move should be directed to that thread and not asked on GitHub. |
@justaugustus: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I just created an PR in k/enhancements, you can continue to put comments there. From my understanding, this proposal is almost ready for getting merged? |
It's a pain point for multi-zone clusters deployment since cross-zone network traffic being charged, while in-zone is not. In addition, cross-node traffic may carry sensitive metadata from other nodes. Therefore, users always prefer the service backends that close to them, e.g. same zone, rack and host etc. for security, performance and cost concerns.
Kubernetes scheduler can constraining a pod to only be able to run on particular nodes/zones. However, Kubernetes service proxy just randomly picks an available backend for service routing and this one can be very far from the user, so we need a topology-aware service routing solution in Kubernetes. Basically, to find the nearest service backend.
Anyway, we have already supported this feature in Huawei Cloud and would like to contribute it to community if people are happy.