Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
KEP: Topology-aware service routing #2846
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: Topology-aware service routing #2846
Changes from all commits
6f7d1f8
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may visibly increase size of Endpoints object, which (due to the fact that all kube-proxies are watching all endpoints objects) is causing a bunch of scalability/performance related issues.
So let me suggest a bit different approach (which is a bit inline with what @thockin already mentioned in a different comment):
So what if we would:
@thockin @johnbelamaric for thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, with your suggested new approach, seems we don't need the Topologies filed in Endpoints object and Endpoints controller will not need to watch nodes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct - though, before you change anything, I would like to hear others opinions about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would mean though that kube-proxy needs to watch this new object, in order to be able to map nodename -> topology and apply the filtering. Is that really better than adding a field to endpointaddress?
Adding the field means that we have one more watch (nodes) from the endpoints controller, versus one per node from kube-proxy.
Also, note that we do NOT need to copy all topology labels to all endpoint addresses. Since we know the topology labels that are needed to make the service policy decision for each specific endpoint, we only need to copy those to the endpoint addresses. So, if someone does not define any service policy, then Topologies is nil. If someone defines a service policy for "node local", then the kubernetes.io/hostname label gets copied to each endpoint address for that service. No other labels need to get copied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of watches itself is not the issue.
It's amount of data that needs to be processed and send is what is causing issues.
So the key factors here is:
That seems to be waaaay more expensive.
Sure - i get that. But we can't assume that people won't be using that feature so we don't need to care about performance.
What i described here is kind of extension of that.
@thockin ^^
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I don't think we can avoid PodLocator. I agree that DNS needs it, at least.
The question for me is whether kube-proxy uses PodLocator (which I think is risky because high-churn clusters could be worse than watching nodes) or NodeLocator (which I really don't want to implement because it feels like a hack) or Nodes (which @wojtek-t correctly ruled out).
There must be a CPU efficient way to do metadata-only watches, but it may be a big enough redesign that it's simply infeasible. E.g. Brian suggested storing metadata as a separate resource in etcd. I'm not sure that this problem (avoiding NodeLocator or NodeMeta as a resource) rises to the level of requiring such a large change, but maybe worth considering.
I expect that decomposing all objects into storage is a HUGE effort, but what if every every write operation to an object key also wrote to a parallel (hidden) key in etcd which was just the metadata of that object? Then watching metadata could be less costly? Are we open to using transactions in etcd for this?
I'm just trying to avoid setting what I think i s a really unfortunate precedent, just because we have insufficient infrastructure to do the right thing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree most of @johnbelamaric opinions. If we want to make this proposal simple, PodLocator should be a good start point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin - I guess I may be a bit too paranoid here...
I think that splitting object metadata to a separate object in etcd would work, but it's hell a lot of work. But fortunately, we are serving watches (by default) from apiserver anyway using watchcache.
So actually I can imagine, customizing it a bit more and allowing producing a separate stream for just metadata. What I was afraid of is that given that there will be multiple (equal to number of nodes) watchers will be watching the same. But those won't have any selectors etc, so we should be able to do all the computations once (i.e. check if something changed).
The question now is: whether we send all updates of Node metadata or only those where something really changed (note that RV always changes). If all, that is really huge amount of data, which is not going to work in my opinion...
But I think that with some additional work in apiserver storage layer, it should be possible to do it somewhat efficiently without any huge changes. So I think it may actually work.
The argument that I still don't fully buy is why we are afraid of watching all PodLocators from all nodes. The amount of data (at least as of now) would still be smaller than amount of data from nodes (assuming we send out all update not only "real updates"). But probably if we start thinking more about the future and the goals to support much higher churn, it may no longer be the case. So I actually probably start buying this argument (at least to some extent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@m1093782566 I am OK to start by using PodLocator in kube-proxy, but we shouldarrange teh code such that it could be replaced with Node metadata if/when we can make that work.
@wojtek-t
NB I was advocating replicating it (storing twice) not splitting, which I HOPED would be less work :)
I think we'd have to filter out RV changes that don't affect the metadata itself
This is what I was hoping you'd eventually say if I threw enough bad ideas at you. We can start with PodLocator, but I'd really like to get metadata watches onto the plan as a general thing.
Number of PodLocators is going to (generally) going to scale with cluster size. E.g a 5k node cluster with 150k Pods has 150k PodLocators. If each node churns 1 pod per 10 seconds, that's 500 QPS to 5000 nodes == 2.5M QPS through API server for literally no value -- all kube-proxy really needs is node labels, which probably changes a few times per DAY in an autoscaled cluster.
So as above, I am ok to use PodLocator to get started on the feature, but I think it will explode in high-scale clusters, so we'll either need efficient metadata watches or we'll need a NodeLocator pre-cooker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thockin
Definitely!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this impact kube-dns as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support it for headless services, it needs to be supported in DNS as well, we can add it to CoreDNS when the time comes.
cc: @chrisohaver @fturib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - it would be somewhat complex for DNS to do split-horizon responses based on client IP, but that is the implication of this (for headless services, anyway). @johnbelamaric
especially since client IP -> Node mapping is not always as easy as a CIDR (some people do not use that and assign arbitrary /32s or multiple /28s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires watching pods which isn't something we would want to do in general, and not something we would want to adapt kube-dns to do. CoreDNS can optionally do that already so if the customer is willing to accept the cost of that we can do it for headless.
For purposes of this KEP we can make is clusterIP services only, and then the headless version can become a feature of CoreDNS maybe. If we do that though, should we enforce anything in the API if the user configures topology for a headless service (assuming we go the route of adding this to service rather than a separate resource)? Otherwise it could cause expectations that aren't really being met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I missed the headless service part in the proposal, will take a deep look at the effect.
For other kinds of service, I assume there is no impact on dns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The document is updated to include the DNS part changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does kube-proxy need PodLocator? It already knows its own Node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think kube-proxy still need PodLocator if it doesn't watch ALL Nodes(only knows its own Node is not sufficient) as it need to map from
Endpoints.subsets[].addresses.nodeName
->PodLocator.nodeName
->PodLocator.nodeLabels
- it's option (3) :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See other (larger) comment wherein we discuss tradoffs of watching nodes vs podlocators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the plan to do this all at the same time? Or to alpha without this and add it for beta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a good plan, we can achieve this feature step by step. I will update this section and outline that headless service will be supported in beta stage.