kube-proxy: initialization check race leads to stale UDP conntrack #126468

wedaly · 2024-07-30T19:21:57Z

What happened?

AKS had a customer report repeated issues in their clusters where:

kube-proxy would redeploy (e.g. due to AKS deploying a new kube-proxy image with CVE fixes)
Envoy c-ares DNS client would send repeated DNS queries to the kube-dns service VIP from the same src IP address, creating a UDP conntrack entry to the svc VIP and keeping it alive. For example, with kube-dns svc VIP 192.168.0.10, conntrack showed:

# conntrack -L | grep 192.168.0.10 | grep UNREPLIED
udp      17 29 src=10.120.1.150 dst=192.168.0.10 sport=49660 dport=53 [UNREPLIED] src=192.168.0.10 dst=10.120.1.150 sport=53 dport=49660 mark=0 use=1
conntrack v1.4.5 (conntrack-tools): 2273 flow entries have been shown

DNS queries to the svc VIP were blackholed until the customer manually ran a command like: conntrack -D -p udp --src 10.120.1.150 --sport 49660

Issue occurred only in clusters with many services and endpoints (~10k services and ~6k endpoints). However, the customer has seen this issue repeatedly for months, about 1-5 times per week across their clusters.

What did you expect to happen?

kube-proxy code to delete stale UDP conntrack entries should have deleted the conntrack entry to kube-dns svc VIP automatically in the first sync after service and endpointslice cache is initialized.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately, the customer said they were unable to reproduce this issue in other environments outside of their production clusters. AKS engineers were also unable to repro this issue.

Update (2024-08-03): I'm now able to reproduce the issue using a DNS client that reuses the same src IP / port and clears DNAT conntrack entries between queries, in a k8s 1.29.7 cluster with 2,000 services. The bug is triggered reliably in this setup with kubectl rollout restart -n kube-system ds/kube-proxy The test client code, scripts, and steps to reproduce are in this repository: https://github.com/wedaly/dns-blackhole-tester

Anything else we need to know?

The customer shared kube-proxy logs (at verbosity "3") from when the issue occurred. Unfortunately, I don't have permission to share the full logs publicly, but I think what I'm seeing in the logs gives a clue about what's happening.

By 05:00:02.477433 nodes, endpointslices, and services all synced:

I0723 05:00:02.000779       1 config.go:322] "Calling handler.OnNodeSynced()"
I0723 05:00:02.177877       1 config.go:104] "Calling handler.OnEndpointSlicesSynced()"
I0723 05:00:02.477433       1 config.go:195] "Calling handler.OnServiceSynced()"

kube-dns endpointslices processed after OnEndpointSlicesSynced(), but before iptables sync:

I0723 05:00:02.760960       1 endpointslicecache.go:358] "Setting endpoints for service port name" portName="kube-system/kube-dns:dns

First iptables sync at 05:00:02.81104 shows 4658 services, which is much less than the number of services in the cluster. (We can't see if kube-dns is one of these services, since kube-proxy at -v=3 doesn't log which services were processed.)

I0723 05:00:02.811044       1 proxier.go:790] "Syncing iptables rules"
I0723 05:00:04.611400       1 proxier.go:1497] "Reloading service iptables data" numServices=4658 numEndpoints=3188 numFilterChains=6 numFilterRules=2919 numNATChains=9867 numNATRules=18214
I0723 05:00:05.721503       1 proxier.go:784] "SyncProxyRules complete" elapsed="3.244002731s"

Between OnServiceSynced() at 05:00:02.47743 and the next sync at 05:00:05.799348, we see thousands of lines that look like services and endpointslices updates:

I0723 05:00:02.492956       1 endpointslicecache.go:358] "Setting endpoints for service port name"
...
I0723 05:00:02.810688       1 utils.go:133] "Skipping service due to cluster IP"

Second sync at 05:00:05.799348 shows 9605 services:

I0723 05:00:05.799348       1 proxier.go:1497] "Reloading service iptables data" numServices=9605 numEndpoints=6244 numFilterChains=6 numFilterRules=5978 numNATChains=4978 numNATRules=14818

I believe this could explain why kube-proxy does not clean up the stale UDP conntrack entry:

kube-proxy detects that the endpointslice and service caches are synced using endpointSliceInformer.Informer().HasSynced and serviceInformer.Informer().HasSynced respectively.
Comments on the HasSynced method in the SharedInformer interface say that this indicates that the informer cache is up-to-date, but warns "Note that this doesn't tell you if an individual handler is synced!! For that, please call HasSynced on the handle returned by AddEventHandler."
This comment, along with the output of kube-proxy logs, makes me suspect that kube-proxy is initiating a sync before it has received and processed all the services and endpoints in the informer cache.
If so, this could explain why the "delete conntrack" codepath is skipped.
- kube-proxy deletes stale conntrack entries for UDP services that are updated from 0 endpoints to at least one endpoint and are in svcPortMap.
- If in the first sync svcPortMap is missing kube-dns, but DeletedUDPEndpoints includes kube-dns, then kube-proxy would skip deletion of the stale conntrack entry. Subsequent syncs would see at least one endpoint for kube-dns, so would continue to skip deletion.

If this theory is correct, then I wonder if kube-proxy could change the initialization check to ensure that all pre-sync events from the informer cache are delivered before the first iptables sync, maybe like this:

diff --git a/pkg/proxy/config/config.go b/pkg/proxy/config/config.go
index 373cd0a1c64..5097ab52715 100644
--- a/pkg/proxy/config/config.go
+++ b/pkg/proxy/config/config.go
@@ -78,11 +78,10 @@ type EndpointSliceConfig struct {
 // NewEndpointSliceConfig creates a new EndpointSliceConfig.
 func NewEndpointSliceConfig(ctx context.Context, endpointSliceInformer discoveryv1informers.EndpointSliceInformer, resyncPeriod time.Duration) *EndpointSliceConfig {
 	result := &EndpointSliceConfig{
-		listerSynced: endpointSliceInformer.Informer().HasSynced,
-		logger:       klog.FromContext(ctx),
+		logger: klog.FromContext(ctx),
 	}
 
-	_, _ = endpointSliceInformer.Informer().AddEventHandlerWithResyncPeriod(
+	handlerRegistration, _ := endpointSliceInformer.Informer().AddEventHandlerWithResyncPeriod(
 		cache.ResourceEventHandlerFuncs{
 			AddFunc:    result.handleAddEndpointSlice,
 			UpdateFunc: result.handleUpdateEndpointSlice,
@@ -91,6 +90,8 @@ func NewEndpointSliceConfig(ctx context.Context, endpointSliceInformer discovery
 		resyncPeriod,
 	)
 
+	result.listerSynced = handlerRegistration.HasSynced
+
 	return result
 }
 
@@ -171,11 +172,10 @@ type ServiceConfig struct {
 // NewServiceConfig creates a new ServiceConfig.
 func NewServiceConfig(ctx context.Context, serviceInformer v1informers.ServiceInformer, resyncPeriod time.Duration) *ServiceConfig {
 	result := &ServiceConfig{
-		listerSynced: serviceInformer.Informer().HasSynced,
-		logger:       klog.FromContext(ctx),
+		logger: klog.FromContext(ctx),
 	}
 
-	_, _ = serviceInformer.Informer().AddEventHandlerWithResyncPeriod(
+	handlerRegistration, _ := serviceInformer.Informer().AddEventHandlerWithResyncPeriod(
 		cache.ResourceEventHandlerFuncs{
 			AddFunc:    result.handleAddService,
 			UpdateFunc: result.handleUpdateService,
@@ -184,6 +184,8 @@ func NewServiceConfig(ctx context.Context, serviceInformer v1informers.ServiceIn
 		resyncPeriod,
 	)
 
+	result.listerSynced = handlerRegistration.HasSynced
+
 	return result
 }

Should kube-proxy wait until all pre-sync events are delivered before setting isInitialized=true?

Kubernetes version

kube-proxy version is v1.28.5

I0723 05:00:01.875477       1 server.go:846] "Version info" version="v1.28.5"

Note that the kube-proxy image that AKS built for 1.28.5 includes the backported patch to fix this race condition in iptables partial sync: #122757

Cloud provider

Azure Kubernetes Service

OS version

Linux (I don't have the exact version but could get it if it's relevant)

Install tools

N/A

Container runtime (CRI) and version (if applicable)

N/A

Related plugins (CNI, CSI, ...) and versions (if applicable)

N/A

The text was updated successfully, but these errors were encountered:

wedaly · 2024-07-30T19:25:30Z

Possibly this issue would be addressed by #126130

MikeZappa87 · 2024-07-30T20:42:35Z

/sig network

aojea · 2024-07-30T21:32:23Z

Possibly this issue would be addressed by #126130

you can workaround this in the meantime with an init container flushing all udp entries conntrack -D -p udp
`

andrewtagg-db · 2024-07-31T22:40:55Z

Possibly this issue would be addressed by #126130

you can workaround this in the meantime with an init container flushing all udp entries conntrack -D -p udp `

Just to clarify, an init container on the kube-proxy pod?

wedaly · 2024-07-31T22:55:07Z

Possibly this issue would be addressed by #126130

you can workaround this in the meantime with an init container flushing all udp entries conntrack -D -p udp `

If I'm reading the code correctly, when kube-dns is missing from svcPortMap, then the iptables restore on the first sync won't install DNAT rules for kube-dns

kubernetes/pkg/proxy/iptables/proxier.go

Line 987 in f8d5b20

for svcName, svc := range proxier.svcPortMap {

If so, I think even with an init container flushing conntrack, there's still a period between iptables restore and the next iptables sync where a DNS client could create a conntrack entry to the svc VIP.

aojea · 2024-08-01T09:56:14Z

If so, I think even with an init container flushing conntrack, there's still a period between iptables restore and the next iptables sync where a DNS client could create a conntrack entry to the svc VIP.

I see, I don't have time to test it, but then you need to do two things

clean up the kube-proxy iptables rules to avoid to have stale entries to services with endpoints that does not exist
flush the existing UDP entries directed to Services

a strawman approach to add to the existing kube-proxy manifest

  initContainers:
  - name: cleanup
    image: kube-proxy-amd64:v1.29.6
    command:
    - /bin/sh
    - -c
    - exec kube-proxy --cleanup && conntrack -D -p udp || true

shaneutt · 2024-08-01T16:17:58Z

@wedaly joined us on the call today, and @aojea confirmed that we should move forward with the patch he proposed:

/triage accepted
/assign @wedaly

wedaly · 2024-08-03T19:30:37Z

I'm now able to reproduce the bug reliably using the test client and steps here: https://github.com/wedaly/dns-blackhole-tester

After applying the patch to wait for handlerRegistration.HasSynced, the issue no longer occurs.

Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP (see kubernetes#126468). Fix it by waiting for the informer event handler to finish delivering all pre-sync events.

Ensure kube-proxy waits for the services/endpointslices informer caches to be synced *and* all pre-sync events delivered before setting isInitialized=true. Otherwise, in clusters with many services, some services may be missing from svcPortMap when kube-proxy starts (e.g. during daemonset rollout). This can cause kube-proxy to temporarily remove service DNAT rules and then skip cleanup of UDP conntrack entries to a service VIP. Resolves: kubernetes#126468

wedaly added the kind/bug Categorizes issue or PR as related to a bug. label Jul 30, 2024

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 30, 2024

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 30, 2024

k8s-ci-robot assigned wedaly Aug 1, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 1, 2024

wedaly mentioned this issue Aug 3, 2024

kube-proxy: initialization wait for service and endpoint handlers synced #126532

Merged

k8s-ci-robot closed this as completed in 5bd6cbb Aug 14, 2024

k8s-ci-robot closed this as completed in #126532 Aug 14, 2024

thatfatguypat mentioned this issue Oct 15, 2024

[EKS] CoreDNS issues after upgrading to 1.29 aws/containers-roadmap#2298

Closed

JacobHenner mentioned this issue Oct 29, 2024

Reuse of UDP socket for DNS resolution incompatible with Kubernetes virtual IPs nginx/nginx#279

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-proxy: initialization check race leads to stale UDP conntrack #126468

kube-proxy: initialization check race leads to stale UDP conntrack #126468

wedaly commented Jul 30, 2024 •

edited

Loading

wedaly commented Jul 30, 2024

MikeZappa87 commented Jul 30, 2024

aojea commented Jul 30, 2024

andrewtagg-db commented Jul 31, 2024

wedaly commented Jul 31, 2024 •

edited

Loading

aojea commented Aug 1, 2024

shaneutt commented Aug 1, 2024

wedaly commented Aug 3, 2024

kube-proxy: initialization check race leads to stale UDP conntrack #126468

kube-proxy: initialization check race leads to stale UDP conntrack #126468

Comments

wedaly commented Jul 30, 2024 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

wedaly commented Jul 30, 2024

MikeZappa87 commented Jul 30, 2024

aojea commented Jul 30, 2024

andrewtagg-db commented Jul 31, 2024

wedaly commented Jul 31, 2024 • edited Loading

aojea commented Aug 1, 2024

shaneutt commented Aug 1, 2024

wedaly commented Aug 3, 2024

wedaly commented Jul 30, 2024 •

edited

Loading

wedaly commented Jul 31, 2024 •

edited

Loading