-
Notifications
You must be signed in to change notification settings - Fork 40.5k
kube-proxy: initialization check race leads to stale UDP conntrack #126468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Possibly this issue would be addressed by #126130 |
/sig network |
you can workaround this in the meantime with an init container flushing all udp entries |
Just to clarify, an init container on the kube-proxy pod? |
If I'm reading the code correctly, when kube-dns is missing from kubernetes/pkg/proxy/iptables/proxier.go Line 987 in f8d5b20
If so, I think even with an init container flushing conntrack, there's still a period between iptables restore and the next iptables sync where a DNS client could create a conntrack entry to the svc VIP. |
I see, I don't have time to test it, but then you need to do two things
a strawman approach to add to the existing kube-proxy manifest
|
I'm now able to reproduce the bug reliably using the test client and steps here: https://github.com/wedaly/dns-blackhole-tester After applying the patch to wait for |
Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP (see kubernetes#126468). Fix it by waiting for the informer event handler to finish delivering all pre-sync events.
Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP. Resolves: kubernetes#126468
Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP. Resolves: kubernetes#126468
Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP. Resolves: kubernetes#126468
Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP. Resolves: kubernetes#126468
Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP. Resolves: kubernetes#126468
Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP. Resolves: kubernetes#126468
Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP. Resolves: kubernetes#126468
What happened?
AKS had a customer report repeated issues in their clusters where:
conntrack -D -p udp --src 10.120.1.150 --sport 49660
Issue occurred only in clusters with many services and endpoints (~10k services and ~6k endpoints). However, the customer has seen this issue repeatedly for months, about 1-5 times per week across their clusters.
What did you expect to happen?
kube-proxy code to delete stale UDP conntrack entries should have deleted the conntrack entry to kube-dns svc VIP automatically in the first sync after service and endpointslice cache is initialized.
How can we reproduce it (as minimally and precisely as possible)?
Unfortunately, the customer said they were unable to reproduce this issue in other environments outside of their production clusters. AKS engineers were also unable to repro this issue.
Update (2024-08-03): I'm now able to reproduce the issue using a DNS client that reuses the same src IP / port and clears DNAT conntrack entries between queries, in a k8s 1.29.7 cluster with 2,000 services. The bug is triggered reliably in this setup with
kubectl rollout restart -n kube-system ds/kube-proxy
The test client code, scripts, and steps to reproduce are in this repository: https://github.com/wedaly/dns-blackhole-testerAnything else we need to know?
The customer shared kube-proxy logs (at verbosity "3") from when the issue occurred. Unfortunately, I don't have permission to share the full logs publicly, but I think what I'm seeing in the logs gives a clue about what's happening.
OnEndpointSlicesSynced()
, but before iptables sync:-v=3
doesn't log which services were processed.)OnServiceSynced()
at 05:00:02.47743 and the next sync at 05:00:05.799348, we see thousands of lines that look like services and endpointslices updates:I believe this could explain why kube-proxy does not clean up the stale UDP conntrack entry:
endpointSliceInformer.Informer().HasSynced
andserviceInformer.Informer().HasSynced
respectively.HasSynced
method in theSharedInformer
interface say that this indicates that the informer cache is up-to-date, but warns "Note that this doesn't tell you if an individual handler is synced!! For that, please call HasSynced on the handle returned by AddEventHandler."svcPortMap
.svcPortMap
is missing kube-dns, butDeletedUDPEndpoints
includes kube-dns, then kube-proxy would skip deletion of the stale conntrack entry. Subsequent syncs would see at least one endpoint for kube-dns, so would continue to skip deletion.If this theory is correct, then I wonder if kube-proxy could change the initialization check to ensure that all pre-sync events from the informer cache are delivered before the first iptables sync, maybe like this:
Should kube-proxy wait until all pre-sync events are delivered before setting
isInitialized=true
?Kubernetes version
kube-proxy version is v1.28.5
Note that the kube-proxy image that AKS built for 1.28.5 includes the backported patch to fix this race condition in iptables partial sync: #122757
Cloud provider
Azure Kubernetes Service
OS version
Linux (I don't have the exact version but could get it if it's relevant)
Install tools
N/A
Container runtime (CRI) and version (if applicable)
N/A
Related plugins (CNI, CSI, ...) and versions (if applicable)
N/A
The text was updated successfully, but these errors were encountered: