Skip to content

Blocking network access to workload clusters causes reconcile queue to grow #8306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vignesh-goutham opened this issue Mar 17, 2023 · 7 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@vignesh-goutham
Copy link

What steps did you take and what happened?

I have a management cluster managing 10 workload clusters. These are tiny sized clusters with 1 control-plane and 1 worker nodes. I blocked network traffic from the management cluster to 4 of the 10 workload clusters. All the controllers take ~30 seconds to invalidate cache, etc and reflect the CR status which all range from mhc failing, kcp unavailable, etc. I do see in the logs that controller runtime timeouts trying to create a client + cache for the remote cluster that are blocked.

E0316 21:50:09.719081       1 controller.go:326] "Reconciler error" err="error creating client and cache for remote cluster: error creating dynamic rest mapper for remote cluster \"eksa-system/vgg-cloudstack-b\": Get \"https://10.80.180.51:6443/api?timeout=10s\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" machine="eksa-system/vgg-cloudstack-b-nlwqk" namespace="eksa-system" name="vgg-cloudstack-b-nlwqk" reconcileID=bf9d8be2-13d8-4b55-a070-ed10249e0443

The clusters at this state are actually fine, all reconcile loops work as expected for the clusters that do have network connectivity to the management cluster.

Now, if I try to create another workload cluster, thats when things get interesting. I noticed, all controllers take a long time to create their respective CRs, and also to update their status. For example, creating a new cluster takes about 5 mins in normal conditions, but with this network block in place, it takes upwards of 30 mins. It takes about 5 mins for a machine to go from pending to provisioned even after a provider ID has been assigned after 1 min. I've observed this become worse as more clusters loose connectivity with the management cluster. The timeouts for creating the client is 10 seconds, and this quickly adds up with multiple clusters having their connectivity blocked.

I have some metrics I pulled off the controllers in grafana that I've attached here. Please note, the queue depth and unfinished work stays at 0 without any network blocks. Look how both of those spike up in this chart below.

Create during blocked nw

I noticed that there are concurrency flags for each controller which all default to 10. I agree setting this to some high number depending on the environment could help in this situation. That said, I think such a network block might occur due to incorrect network policies in a prod environment or other reasons, and that shouldn't cause issues on other cluster operations like create/upgrade/delete, and it might be hard to estimate the concurrency number required for an env that has dynamic number of clusters.

I have 2 suggestions

  1. Can we add an argument to expose and setup controller ratelimiting
  2. Can we expose arguments to expose client config, like timeout.

I'd prefer rate limiting over option 2 though.

I also tried pausing (spec.pause) all the clusters that had the network blocks, it took sometime for the controller to finish the jobs queued up, but any operation once the queue cleared, didnt spike up the queue depth. Create cluster was smooth as it was before network block.
Post Spec pause + create cluster

Notice how the queue dropped to 0 and stayed there after the pause on spec.

What did you expect to happen?

Expose a rate limiting option that could stop the queue from growing in situations like this.

Cluster API version

1.2.0

Kubernetes version

1.23

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 17, 2023
@killianmuldoon
Copy link
Contributor

This is awesome work @vignesh-goutham! There have been a number of improvements since v1.2.0 in the ClusterCacheTracker. Those issues could account for this problem.

Would you be able to test this with a more up-to-date version of CAPI?

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 17, 2023
@vignesh-goutham
Copy link
Author

Thanks. Sure I can test with 1.3.4 CAPI. I'll post my findings once I have something.

@sbueringer
Copy link
Member

sbueringer commented Mar 20, 2023

@vignesh-goutham Thx for the extensive analysis. Specifically this change (#7537) which was also backported to v1.2.x might have already resolved this issue or at least improved the behavior.

I would recommend v1.3.5 instead of v1.3.4. There were 2 more fixes in ClusterCacheTracker there (https://github.com/kubernetes-sigs/cluster-api/releases/tag/v1.3.5) (they shouldn't impact this behavior, but are definitely nice2have)

@vignesh-goutham
Copy link
Author

vignesh-goutham commented Apr 4, 2023

Sorry for the late reply. So I upgraded CAPI to 1.3.5. Its so much better than 1.2.0. I ran some more tests with this setup to try and see if I can simulate a more stressed situation. The machine controller queue grew a bit (still not as worse as with 1.2.0) when I ran 2 workload cluster create and 1 workload cluster upgrade in parallel. I did observe some delay in the order of 2 extra mins vs what it took when running without blocking network access to 4 workload clusters (out of 11)

Screen Shot 2023-03-17 at 4 41 04 PM

Please ignore the dashboard name pointing to 1.3.4. The CAPI version was verified to be 1.3.5.

I let the clusters sit over the weekend, and saw something very interesting. This setup has 1 management cluster, 13 workload clusters with 4 workload clusters blocked network access to management cluster. The machine controller queue on occasions shot up to 45 depth. I had verified that no new machines were rolled out and the environment was pretty stable as well. Any idea on why this might have occurred? I still think implementing a controller ratelimiter would be beneficial, especially when situation like this has potential to get aggravated with lot more clusters a production environment would run, say upwards of 50 workload clusters. Please take a look at the chart below.

Screen Shot 2023-03-20 at 10 24 00 AM

I'd love to hear any other suggestions to improve this condition as well. I can take a shot at implementing it.

@sbueringer
Copy link
Member

Maybe it's just because all Machines are regularly reconciled. Either because of the syncPeriod or because they are hitting a code path which requeues (e.g. there is one when we are not able to create a client to access the workload cluster,

if errors.Is(err, remote.ErrClusterLocked) {
log.V(5).Info("Requeuing because another worker has the lock on the ClusterCacheTracker")
return ctrl.Result{Requeue: true}, nil
}
return res, err
)

I wonder what logs you are seeing at the time of the spikes. I also wonder if there are a lot more spikes and they are just not visible in monitoring because the periodic metric scrapes don't detect spikes which happen in between the scrapes.

@fabriziopandini
Copy link
Member

/close
versions discussed in this issue are now out of support.
it will be great to repeat the same test with new releases (great work btw) + eventually report back on a new issue

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/close
versions discussed in this issue are now out of support.
it will be great to repeat the same test with new releases (great work btw) + eventually report back on a new issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants