Skip to content

kube-proxy: initialization check race leads to stale UDP conntrack #126468

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wedaly opened this issue Jul 30, 2024 · 8 comments · Fixed by #126532
Closed

kube-proxy: initialization check race leads to stale UDP conntrack #126468

wedaly opened this issue Jul 30, 2024 · 8 comments · Fixed by #126532
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@wedaly
Copy link
Member

wedaly commented Jul 30, 2024

What happened?

AKS had a customer report repeated issues in their clusters where:

  1. kube-proxy would redeploy (e.g. due to AKS deploying a new kube-proxy image with CVE fixes)
  2. Envoy c-ares DNS client would send repeated DNS queries to the kube-dns service VIP from the same src IP address, creating a UDP conntrack entry to the svc VIP and keeping it alive. For example, with kube-dns svc VIP 192.168.0.10, conntrack showed:
# conntrack -L | grep 192.168.0.10 | grep UNREPLIED
udp      17 29 src=10.120.1.150 dst=192.168.0.10 sport=49660 dport=53 [UNREPLIED] src=192.168.0.10 dst=10.120.1.150 sport=53 dport=49660 mark=0 use=1
conntrack v1.4.5 (conntrack-tools): 2273 flow entries have been shown
  1. DNS queries to the svc VIP were blackholed until the customer manually ran a command like: conntrack -D -p udp --src 10.120.1.150 --sport 49660

Issue occurred only in clusters with many services and endpoints (~10k services and ~6k endpoints). However, the customer has seen this issue repeatedly for months, about 1-5 times per week across their clusters.

What did you expect to happen?

kube-proxy code to delete stale UDP conntrack entries should have deleted the conntrack entry to kube-dns svc VIP automatically in the first sync after service and endpointslice cache is initialized.

How can we reproduce it (as minimally and precisely as possible)?

Unfortunately, the customer said they were unable to reproduce this issue in other environments outside of their production clusters. AKS engineers were also unable to repro this issue.

Update (2024-08-03): I'm now able to reproduce the issue using a DNS client that reuses the same src IP / port and clears DNAT conntrack entries between queries, in a k8s 1.29.7 cluster with 2,000 services. The bug is triggered reliably in this setup with kubectl rollout restart -n kube-system ds/kube-proxy The test client code, scripts, and steps to reproduce are in this repository: https://github.com/wedaly/dns-blackhole-tester

Anything else we need to know?

The customer shared kube-proxy logs (at verbosity "3") from when the issue occurred. Unfortunately, I don't have permission to share the full logs publicly, but I think what I'm seeing in the logs gives a clue about what's happening.

  1. By 05:00:02.477433 nodes, endpointslices, and services all synced:
I0723 05:00:02.000779       1 config.go:322] "Calling handler.OnNodeSynced()"
I0723 05:00:02.177877       1 config.go:104] "Calling handler.OnEndpointSlicesSynced()"
I0723 05:00:02.477433       1 config.go:195] "Calling handler.OnServiceSynced()"
  1. kube-dns endpointslices processed after OnEndpointSlicesSynced(), but before iptables sync:
I0723 05:00:02.760960       1 endpointslicecache.go:358] "Setting endpoints for service port name" portName="kube-system/kube-dns:dns
  1. First iptables sync at 05:00:02.81104 shows 4658 services, which is much less than the number of services in the cluster. (We can't see if kube-dns is one of these services, since kube-proxy at -v=3 doesn't log which services were processed.)
I0723 05:00:02.811044       1 proxier.go:790] "Syncing iptables rules"
I0723 05:00:04.611400       1 proxier.go:1497] "Reloading service iptables data" numServices=4658 numEndpoints=3188 numFilterChains=6 numFilterRules=2919 numNATChains=9867 numNATRules=18214
I0723 05:00:05.721503       1 proxier.go:784] "SyncProxyRules complete" elapsed="3.244002731s"
  1. Between OnServiceSynced() at 05:00:02.47743 and the next sync at 05:00:05.799348, we see thousands of lines that look like services and endpointslices updates:
I0723 05:00:02.492956       1 endpointslicecache.go:358] "Setting endpoints for service port name"
...
I0723 05:00:02.810688       1 utils.go:133] "Skipping service due to cluster IP"
  1. Second sync at 05:00:05.799348 shows 9605 services:
I0723 05:00:05.799348       1 proxier.go:1497] "Reloading service iptables data" numServices=9605 numEndpoints=6244 numFilterChains=6 numFilterRules=5978 numNATChains=4978 numNATRules=14818

I believe this could explain why kube-proxy does not clean up the stale UDP conntrack entry:

  • kube-proxy detects that the endpointslice and service caches are synced using endpointSliceInformer.Informer().HasSynced and serviceInformer.Informer().HasSynced respectively.
  • Comments on the HasSynced method in the SharedInformer interface say that this indicates that the informer cache is up-to-date, but warns "Note that this doesn't tell you if an individual handler is synced!! For that, please call HasSynced on the handle returned by AddEventHandler."
  • This comment, along with the output of kube-proxy logs, makes me suspect that kube-proxy is initiating a sync before it has received and processed all the services and endpoints in the informer cache.
  • If so, this could explain why the "delete conntrack" codepath is skipped.

If this theory is correct, then I wonder if kube-proxy could change the initialization check to ensure that all pre-sync events from the informer cache are delivered before the first iptables sync, maybe like this:

diff --git a/pkg/proxy/config/config.go b/pkg/proxy/config/config.go
index 373cd0a1c64..5097ab52715 100644
--- a/pkg/proxy/config/config.go
+++ b/pkg/proxy/config/config.go
@@ -78,11 +78,10 @@ type EndpointSliceConfig struct {
 // NewEndpointSliceConfig creates a new EndpointSliceConfig.
 func NewEndpointSliceConfig(ctx context.Context, endpointSliceInformer discoveryv1informers.EndpointSliceInformer, resyncPeriod time.Duration) *EndpointSliceConfig {
 	result := &EndpointSliceConfig{
-		listerSynced: endpointSliceInformer.Informer().HasSynced,
-		logger:       klog.FromContext(ctx),
+		logger: klog.FromContext(ctx),
 	}
 
-	_, _ = endpointSliceInformer.Informer().AddEventHandlerWithResyncPeriod(
+	handlerRegistration, _ := endpointSliceInformer.Informer().AddEventHandlerWithResyncPeriod(
 		cache.ResourceEventHandlerFuncs{
 			AddFunc:    result.handleAddEndpointSlice,
 			UpdateFunc: result.handleUpdateEndpointSlice,
@@ -91,6 +90,8 @@ func NewEndpointSliceConfig(ctx context.Context, endpointSliceInformer discovery
 		resyncPeriod,
 	)
 
+	result.listerSynced = handlerRegistration.HasSynced
+
 	return result
 }
 
@@ -171,11 +172,10 @@ type ServiceConfig struct {
 // NewServiceConfig creates a new ServiceConfig.
 func NewServiceConfig(ctx context.Context, serviceInformer v1informers.ServiceInformer, resyncPeriod time.Duration) *ServiceConfig {
 	result := &ServiceConfig{
-		listerSynced: serviceInformer.Informer().HasSynced,
-		logger:       klog.FromContext(ctx),
+		logger: klog.FromContext(ctx),
 	}
 
-	_, _ = serviceInformer.Informer().AddEventHandlerWithResyncPeriod(
+	handlerRegistration, _ := serviceInformer.Informer().AddEventHandlerWithResyncPeriod(
 		cache.ResourceEventHandlerFuncs{
 			AddFunc:    result.handleAddService,
 			UpdateFunc: result.handleUpdateService,
@@ -184,6 +184,8 @@ func NewServiceConfig(ctx context.Context, serviceInformer v1informers.ServiceIn
 		resyncPeriod,
 	)
 
+	result.listerSynced = handlerRegistration.HasSynced
+
 	return result
 }

Should kube-proxy wait until all pre-sync events are delivered before setting isInitialized=true?

Kubernetes version

kube-proxy version is v1.28.5

I0723 05:00:01.875477       1 server.go:846] "Version info" version="v1.28.5"

Note that the kube-proxy image that AKS built for 1.28.5 includes the backported patch to fix this race condition in iptables partial sync: #122757

Cloud provider

Azure Kubernetes Service

OS version

Linux (I don't have the exact version but could get it if it's relevant)

Install tools

N/A

Container runtime (CRI) and version (if applicable)

N/A

Related plugins (CNI, CSI, ...) and versions (if applicable)

N/A

@wedaly wedaly added the kind/bug Categorizes issue or PR as related to a bug. label Jul 30, 2024
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 30, 2024
@wedaly
Copy link
Member Author

wedaly commented Jul 30, 2024

Possibly this issue would be addressed by #126130

@MikeZappa87
Copy link

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 30, 2024
@aojea
Copy link
Member

aojea commented Jul 30, 2024

Possibly this issue would be addressed by #126130

you can workaround this in the meantime with an init container flushing all udp entries conntrack -D -p udp
`

@andrewtagg-db
Copy link

Possibly this issue would be addressed by #126130

you can workaround this in the meantime with an init container flushing all udp entries conntrack -D -p udp `

Just to clarify, an init container on the kube-proxy pod?

@wedaly
Copy link
Member Author

wedaly commented Jul 31, 2024

Possibly this issue would be addressed by #126130

you can workaround this in the meantime with an init container flushing all udp entries conntrack -D -p udp `

If I'm reading the code correctly, when kube-dns is missing from svcPortMap, then the iptables restore on the first sync won't install DNAT rules for kube-dns

for svcName, svc := range proxier.svcPortMap {

If so, I think even with an init container flushing conntrack, there's still a period between iptables restore and the next iptables sync where a DNS client could create a conntrack entry to the svc VIP.

@aojea
Copy link
Member

aojea commented Aug 1, 2024

If so, I think even with an init container flushing conntrack, there's still a period between iptables restore and the next iptables sync where a DNS client could create a conntrack entry to the svc VIP.

I see, I don't have time to test it, but then you need to do two things

  1. clean up the kube-proxy iptables rules to avoid to have stale entries to services with endpoints that does not exist
  2. flush the existing UDP entries directed to Services

a strawman approach to add to the existing kube-proxy manifest

  initContainers:
  - name: cleanup
    image: kube-proxy-amd64:v1.29.6
    command:
    - /bin/sh
    - -c
    - exec kube-proxy --cleanup && conntrack -D -p udp || true

@shaneutt
Copy link
Member

shaneutt commented Aug 1, 2024

@wedaly joined us on the call today, and @aojea confirmed that we should move forward with the patch he proposed:

/triage accepted
/assign @wedaly

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 1, 2024
@wedaly
Copy link
Member Author

wedaly commented Aug 3, 2024

I'm now able to reproduce the bug reliably using the test client and steps here: https://github.com/wedaly/dns-blackhole-tester

After applying the patch to wait for handlerRegistration.HasSynced, the issue no longer occurs.

wedaly pushed a commit to wedaly/kubernetes that referenced this issue Aug 3, 2024
Ensure kube-proxy waits for the services/endpointslices informer
caches to be synced *and* all pre-sync events delivered before
setting isInitialized=true. Otherwise, in clusters with many services,
some services may be missing from svcPortMap when kube-proxy starts
(e.g. during daemonset rollout). This can cause kube-proxy to temporarily
remove service DNAT rules and then skip cleanup of UDP conntrack entries
to a service VIP (see kubernetes#126468).

Fix it by waiting for the informer event handler to finish
delivering all pre-sync events.
wedaly pushed a commit to wedaly/kubernetes that referenced this issue Aug 14, 2024
Ensure kube-proxy waits for the services/endpointslices informer
caches to be synced *and* all pre-sync events delivered before
setting isInitialized=true. Otherwise, in clusters with many services,
some services may be missing from svcPortMap when kube-proxy starts
(e.g. during daemonset rollout). This can cause kube-proxy to temporarily
remove service DNAT rules and then skip cleanup of UDP conntrack entries
to a service VIP.

Resolves: kubernetes#126468
wedaly pushed a commit to wedaly/kubernetes that referenced this issue Aug 14, 2024
Ensure kube-proxy waits for the services/endpointslices informer
caches to be synced *and* all pre-sync events delivered before
setting isInitialized=true. Otherwise, in clusters with many services,
some services may be missing from svcPortMap when kube-proxy starts
(e.g. during daemonset rollout). This can cause kube-proxy to temporarily
remove service DNAT rules and then skip cleanup of UDP conntrack entries
to a service VIP.

Resolves: kubernetes#126468
wedaly pushed a commit to wedaly/kubernetes that referenced this issue Aug 14, 2024
Ensure kube-proxy waits for the services/endpointslices informer
caches to be synced *and* all pre-sync events delivered before
setting isInitialized=true. Otherwise, in clusters with many services,
some services may be missing from svcPortMap when kube-proxy starts
(e.g. during daemonset rollout). This can cause kube-proxy to temporarily
remove service DNAT rules and then skip cleanup of UDP conntrack entries
to a service VIP.

Resolves: kubernetes#126468
wedaly pushed a commit to wedaly/kubernetes that referenced this issue Aug 14, 2024
Ensure kube-proxy waits for the services/endpointslices informer
caches to be synced *and* all pre-sync events delivered before
setting isInitialized=true. Otherwise, in clusters with many services,
some services may be missing from svcPortMap when kube-proxy starts
(e.g. during daemonset rollout). This can cause kube-proxy to temporarily
remove service DNAT rules and then skip cleanup of UDP conntrack entries
to a service VIP.

Resolves: kubernetes#126468
hungnguyen243 pushed a commit to hungnguyen243/kubernetes that referenced this issue Aug 16, 2024
Ensure kube-proxy waits for the services/endpointslices informer
caches to be synced *and* all pre-sync events delivered before
setting isInitialized=true. Otherwise, in clusters with many services,
some services may be missing from svcPortMap when kube-proxy starts
(e.g. during daemonset rollout). This can cause kube-proxy to temporarily
remove service DNAT rules and then skip cleanup of UDP conntrack entries
to a service VIP.

Resolves: kubernetes#126468
SoulPancake pushed a commit to SoulPancake/k8s that referenced this issue Sep 11, 2024
Ensure kube-proxy waits for the services/endpointslices informer
caches to be synced *and* all pre-sync events delivered before
setting isInitialized=true. Otherwise, in clusters with many services,
some services may be missing from svcPortMap when kube-proxy starts
(e.g. during daemonset rollout). This can cause kube-proxy to temporarily
remove service DNAT rules and then skip cleanup of UDP conntrack entries
to a service VIP.

Resolves: kubernetes#126468
princepereira pushed a commit to princepereira/kubernetes that referenced this issue Feb 19, 2025
Ensure kube-proxy waits for the services/endpointslices informer
caches to be synced *and* all pre-sync events delivered before
setting isInitialized=true. Otherwise, in clusters with many services,
some services may be missing from svcPortMap when kube-proxy starts
(e.g. during daemonset rollout). This can cause kube-proxy to temporarily
remove service DNAT rules and then skip cleanup of UDP conntrack entries
to a service VIP.

Resolves: kubernetes#126468
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants