Blocking network access to workload clusters causes reconcile queue to grow #8306

vignesh-goutham · 2023-03-17T17:15:51Z

What steps did you take and what happened?

I have a management cluster managing 10 workload clusters. These are tiny sized clusters with 1 control-plane and 1 worker nodes. I blocked network traffic from the management cluster to 4 of the 10 workload clusters. All the controllers take ~30 seconds to invalidate cache, etc and reflect the CR status which all range from mhc failing, kcp unavailable, etc. I do see in the logs that controller runtime timeouts trying to create a client + cache for the remote cluster that are blocked.

E0316 21:50:09.719081       1 controller.go:326] "Reconciler error" err="error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"eksa-system/vgg-cloudstack-b\": Get \"https://10.80.180.51:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="eksa-system/vgg-cloudstack-b-nlwqk" namespace="eksa-system" name="vgg-cloudstack-b-nlwqk" reconcileID=bf9d8be2-13d8-4b55-a070-ed10249e0443

The clusters at this state are actually fine, all reconcile loops work as expected for the clusters that do have network connectivity to the management cluster.

Now, if I try to create another workload cluster, thats when things get interesting. I noticed, all controllers take a long time to create their respective CRs, and also to update their status. For example, creating a new cluster takes about 5 mins in normal conditions, but with this network block in place, it takes upwards of 30 mins. It takes about 5 mins for a machine to go from pending to provisioned even after a provider ID has been assigned after 1 min. I've observed this become worse as more clusters loose connectivity with the management cluster. The timeouts for creating the client is 10 seconds, and this quickly adds up with multiple clusters having their connectivity blocked.

I have some metrics I pulled off the controllers in grafana that I've attached here. Please note, the queue depth and unfinished work stays at 0 without any network blocks. Look how both of those spike up in this chart below.

I noticed that there are concurrency flags for each controller which all default to 10. I agree setting this to some high number depending on the environment could help in this situation. That said, I think such a network block might occur due to incorrect network policies in a prod environment or other reasons, and that shouldn't cause issues on other cluster operations like create/upgrade/delete, and it might be hard to estimate the concurrency number required for an env that has dynamic number of clusters.

I have 2 suggestions

Can we add an argument to expose and setup controller ratelimiting
Can we expose arguments to expose client config, like timeout.

I'd prefer rate limiting over option 2 though.

I also tried pausing (spec.pause) all the clusters that had the network blocks, it took sometime for the controller to finish the jobs queued up, but any operation once the queue cleared, didnt spike up the queue depth. Create cluster was smooth as it was before network block.

Notice how the queue dropped to 0 and stayed there after the pause on spec.

What did you expect to happen?

Expose a rate limiting option that could stop the queue from growing in situations like this.

Cluster API version

1.2.0

Kubernetes version

1.23

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

killianmuldoon · 2023-03-17T17:54:18Z

This is awesome work @vignesh-goutham! There have been a number of improvements since v1.2.0 in the ClusterCacheTracker. Those issues could account for this problem.

Would you be able to test this with a more up-to-date version of CAPI?

/triage accepted

vignesh-goutham · 2023-03-17T20:33:01Z

Thanks. Sure I can test with 1.3.4 CAPI. I'll post my findings once I have something.

sbueringer · 2023-03-20T05:14:40Z

@vignesh-goutham Thx for the extensive analysis. Specifically this change (#7537) which was also backported to v1.2.x might have already resolved this issue or at least improved the behavior.

I would recommend v1.3.5 instead of v1.3.4. There were 2 more fixes in ClusterCacheTracker there (https://github.com/kubernetes-sigs/cluster-api/releases/tag/v1.3.5) (they shouldn't impact this behavior, but are definitely nice2have)

vignesh-goutham · 2023-04-04T17:11:10Z

Sorry for the late reply. So I upgraded CAPI to 1.3.5. Its so much better than 1.2.0. I ran some more tests with this setup to try and see if I can simulate a more stressed situation. The machine controller queue grew a bit (still not as worse as with 1.2.0) when I ran 2 workload cluster create and 1 workload cluster upgrade in parallel. I did observe some delay in the order of 2 extra mins vs what it took when running without blocking network access to 4 workload clusters (out of 11)

Please ignore the dashboard name pointing to 1.3.4. The CAPI version was verified to be 1.3.5.

I let the clusters sit over the weekend, and saw something very interesting. This setup has 1 management cluster, 13 workload clusters with 4 workload clusters blocked network access to management cluster. The machine controller queue on occasions shot up to 45 depth. I had verified that no new machines were rolled out and the environment was pretty stable as well. Any idea on why this might have occurred? I still think implementing a controller ratelimiter would be beneficial, especially when situation like this has potential to get aggravated with lot more clusters a production environment would run, say upwards of 50 workload clusters. Please take a look at the chart below.

I'd love to hear any other suggestions to improve this condition as well. I can take a shot at implementing it.

sbueringer · 2023-04-26T12:28:32Z

Maybe it's just because all Machines are regularly reconciled. Either because of the syncPeriod or because they are hitting a code path which requeues (e.g. there is one when we are not able to create a client to access the workload cluster,

cluster-api/internal/controllers/machine/machine_controller.go

Lines 225 to 229 in c21b898

    
           if errors.Is(err, remote.ErrClusterLocked) { 
        
           	log.V(5).Info("Requeuing because another worker has the lock on the ClusterCacheTracker") 
        
           	return ctrl.Result{Requeue: true}, nil 
        
           } 
        
           return res, err

)

I wonder what logs you are seeing at the time of the spikes. I also wonder if there are a lot more spikes and they are just not visible in monitoring because the periodic metric scrapes don't detect spikes which happen in between the scrapes.

fabriziopandini · 2024-04-11T19:32:30Z

/close
versions discussed in this issue are now out of support.
it will be great to repeat the same test with new releases (great work btw) + eventually report back on a new issue

k8s-ci-robot · 2024-04-11T19:32:35Z

@fabriziopandini: Closing this issue.

In response to this:

/close
versions discussed in this issue are now out of support.
it will be great to repeat the same test with new releases (great work btw) + eventually report back on a new issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 17, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 17, 2023

fabriziopandini mentioned this issue Apr 11, 2023

Document guidelines around CAPI cluster scalability #7308

Closed

vignesh-goutham mentioned this issue Mar 14, 2024

REQUEST: New membership for vignesh-goutham kubernetes/org#4826

Closed

9 tasks

k8s-ci-robot closed this as completed Apr 11, 2024

squizzi mentioned this issue Aug 26, 2024

CAPA fails to add Node's to managed cluster with "unable to retrieve the complete list of server APIs: context deadline exceeded" k0rdent/kcm#232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking network access to workload clusters causes reconcile queue to grow #8306

Blocking network access to workload clusters causes reconcile queue to grow #8306

vignesh-goutham commented Mar 17, 2023

killianmuldoon commented Mar 17, 2023

vignesh-goutham commented Mar 17, 2023

sbueringer commented Mar 20, 2023 •

edited

Loading

vignesh-goutham commented Apr 4, 2023 •

edited

Loading

sbueringer commented Apr 26, 2023

fabriziopandini commented Apr 11, 2024

k8s-ci-robot commented Apr 11, 2024

Blocking network access to workload clusters causes reconcile queue to grow #8306

Blocking network access to workload clusters causes reconcile queue to grow #8306

Comments

vignesh-goutham commented Mar 17, 2023

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

killianmuldoon commented Mar 17, 2023

vignesh-goutham commented Mar 17, 2023

sbueringer commented Mar 20, 2023 • edited Loading

vignesh-goutham commented Apr 4, 2023 • edited Loading

sbueringer commented Apr 26, 2023

fabriziopandini commented Apr 11, 2024

k8s-ci-robot commented Apr 11, 2024

sbueringer commented Mar 20, 2023 •

edited

Loading

vignesh-goutham commented Apr 4, 2023 •

edited

Loading