Drain workload clusters of service Type=Loadbalancer on teardown #3075

randomvariable · 2020-05-19T09:18:01Z

User Story

As a cluster operator, I delete a cluster (e.g. in AWS), but my cluster won't delete infrastructure objects because developers had created services of Type=LoadBalancer that consumed the cluster resources, the cloud provider controller has already been deleted and therefore can't remove them and I have to manually clean up the resources.

Detailed Description

An extension of kubernetes-sigs/cluster-api-provider-aws#1718, there is logic in CAPA to delete all ELBv1s matching the cluster type, but as soon as users add a different loadbalancer provisioner, say NLBs or ALBs, we would have to add additional code to CAPA to handle this.

This could be extended in the future to other AWS resources.

We might want to think of a mechanism to in the first instance tear down services of Type=LoadBalancer in the workload cluster prior to destroying the cluster.

Anything else you would like to add:

[Miscellaneous information that will assist in solving the issue.]

/kind feature

vincepri · 2020-05-19T14:32:39Z

This is an interesting one, @detiber @ncdc @CecileRobertMichon thoughts on getting this in v0.3.x? It could be an optional flag on the Cluster object. I'd discuss it at a community meeting as well to get more thoughts

detiber · 2020-05-19T14:35:01Z

@vincepri we could also likely introduce it as an experimental command line flag to start with rather than modifying the Cluster object to start.

randomvariable · 2020-05-19T14:38:17Z

Feature gate seems appropriate to me. I've already put it down on the agenda for 2020/05/20.

ncdc · 2020-05-20T17:22:04Z

We had talked about this previously as an alternative to infra providers having to manage this. I ok experimenting with it.

vincepri · 2020-05-20T17:51:13Z

/milestone Next

CecileRobertMichon · 2020-05-20T23:06:21Z

I looked into this a bit more and I think this doesn't apply to CAPZ. Azure has resource groups so all the resources for the cluster are contained within a resource group, including the load balancer created by cloud provider for exposing services. Also, the Azure vnet doesn't have any dependencies on the load balancer, only network interfaces do so I think since the AzureMachines get deleted first, it's all good. When the resource group gets deleted, all the remaining resources also get deleted including the LB and public IP associated with the service. The only scenario where resources created out of band wouldn't be deleted is if the user brings their own resource group, in which case the resource group is "unmanaged" and CAPZ won't delete it, so LBs created by cloud provider and any other resources manually created by the user won't get deleted, but I think that's okay since CAPZ can still delete all the Azure resources it created and manages.

I just tried to repro this to make sure and I was able to create a CAPZ cluster, expose a simple service type LB (verified that a LB got provisioned in my resource group) and then delete the cluster with kubectl delete cluster, and all resources were deleted without a problem.

randomvariable · 2020-05-21T13:48:01Z

Thanks, I thought resource groups, that feature if only AWS had :( would mitigate having to do this in Azure.

If this is really AWS specific, but also in CAPA we don't want to have to sync with any number of exotic AWS LB provisioners, then perhaps a variant of @michaelgugino 's machine lifecycle hooks proposal is relevant, but with CAPA issuing the deletes to loadbalancer type services in the workload cluster.

voor · 2020-05-26T18:25:57Z

I've been kicking this around in the old noggin, I'm wondering if we're taking what might become a more significant issue and trying to band-aid fix it. As more external dependencies are impacted by clusters, perhaps there should be a better way to handle graceful destruction of stuff? I can't think of any immediate other comparables, but this is something that has come up in carvel-dev/kapp-controller#7 since certain things that were dependent on the cluster are not continually trying to remediate stuff that no longer exists.

fejta-bot · 2020-08-24T18:36:49Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

vincepri · 2020-08-24T18:37:34Z

/lifecycle frozen

yastij · 2020-08-25T16:07:00Z

this issue should be generic to other first-class kubernetes objects such as PVs

detiber · 2020-08-26T17:32:37Z

this issue should be generic to other first-class kubernetes objects such as PVs

I think we need to be much more conservative around PVs, especially since it could result in unexpected data loss.

randomvariable · 2020-08-26T17:42:07Z

💯 Intent around reclaim policies during the lifetime of a cluster and at the end of its life is tricky. We definitely should not delete PVs which were part of PVCs that don't have a delete reclaim policy.

maplain · 2020-10-08T20:16:24Z

I think a generic solution is needed as long as people use an agent in a cluster to manage external infrastructure resources. My personal understanding is this is not CAPI specific, more like a mismatch between cluster lifecycle and the external infrastructure resource lifecycles. The former is bound to the "Cluster" CR(or other constructs in other scenarios) whereas the latter is bound to the agent(Pod or CNI?) running in the Cluster.

The disconnection is between the "Cluster" CR's lifecycle events and the agent.

Most of the time a simple signal to the agent might be enough, like the existing mechanism: non-blocking preStop hook, SIGTERM + graceperiod + SIGKILL.
But sometimes though, it should be rich enough for the receiver to distinguish various lifecycle events originated from different sources. eg: a stateless agent receives a SIGTERM, should it clean up all resources on the external infrastructure or do nothing? If it's a cluster shutdown, it should go with the former, but if it's initiated by a rolling update, it should probably go with the latter.

There are three high-level approaches I can think of: rich "signaling" to the in-cluster subscriber, a higher level construct like Azure's resource group mentioned above, or a webhook like entitiy sitting in the critical path for executing various hooks.
The 2nd makes sense when it's easier to wrap everything's lifecycles into one big envelop, but it'll break the moment user goes beyond its boundary, eg: using another resource group, or provision resources on another cloud.
The 1st builds an interface and requires the agent's support. The 2nd and 3rd make good managed solutions and keep the agent simple(if not moving the logic around).

For concrete examples:

could be implemented as a CAPI-level controller, adding/removing a finalizer on Cluster, communicating with the in-cluster agent out-of-band;
Azure's resource group like mentioned above;
I've seen a proxy-based implementation in a non-CAPI world for SDN resources cleanup(not only LBs)

maplain · 2020-10-21T19:41:41Z

Would something like a ClusterResourceGarbageCollector help in this situation?

This controller simply adds a finalizer on the Cluster CR when it's created, and creates a Provider-specific CR to request GC any external resources for that Cluster when it's being deleted. Status will be synced from the Provider-specific CR to the Cluster and once everything finishes successfully, the finalizer will be removed. Of course, this description over-simplifies the question a bit. But would love to hear your thoughts before going any further.

@ncdc @vincepri @yastij what do you think?

detiber · 2020-10-21T19:54:33Z

@maplain I think that could indeed be a potential solution. My main concern would be that it seems to externalize the responsibility to users. I think it would be a fine approach for the short term, I would just like for us to continue to pursue a better solution longer term.

I'm not sure we have a good way within Cluster API itself to know what all types could exist that would need to be cleaned up on Cluster deletion (and which explicitly not to, such as the case for certain persistent storage).

Ideally I would love to see this be something that k8s provided first class support for (CRDs could register themselves as managing external resources along with appropriate deletion polices related to the cluster's lifecycle), way to trigger cleanup of said resources from a centralized endpoint, and a way of verifying that operations are completed.

Then it would just be a matter hand waving of us invoking the endpoint for cleanup, and waiting for things to complete (with a possible timeout). We would have to make sure that this process doesn't yank the infrastructure we are running on out from underneath us, but it prevents us putting the onus on users to explicitly configure what types of resources to cleanup.

maplain · 2020-10-21T20:17:30Z

thx for your reply @detiber !

yes. You're right. I was thinking that's something users have to configure at the very beginning, eg: the --infrastructure flag when initializing a management cluster.
I guess to make it more flexible(or complex), a ConfigMap could be added so the CRGC can discover various providers in runtime.

The CRD approach is very interesting. If my understanding is correct, it's similar to the ConfigMap mentioned above in that it allows the discovery of providers for a certain lifecycle event. What's better is it's "decentralized" so CAPI controllers are agnostic of those providers, who register themselves in a "self-service" way as a "cleanup" service.
This requires the API to support semantics like: expressing the relationship between objects, expressing action/hook/lifecycle event/trigger on that relationship, policies on the trigger etc.
Does this understanding match what you have in mind?

Also, do you mind sharing some pointers so I could study more in this direction? Thx in advance!

detiber · 2020-10-22T20:38:10Z

@maplain I wasn't necessarily thinking in terms of trying to encode the dependencies in the API as much as just a simple way to trigger deletion automatically. Dependency management, ordering, etc can be left to the individual controllers, which could leverage finalizers and handle things as they need to.

I don't necessarily have any pointers to give, other than any such proposal would need to be created against Kubernetes itself and would likely involve SIG Cluster Lifecycle and SIG API Machinery.

maplain · 2020-10-23T00:02:47Z

gotcha. While thinking about ClusterResourceGarbageCollector, I discovered #3132, which seems a lot like what I want. Will play around with it first.

vincepri · 2021-10-22T18:24:35Z

/assign @randomvariable
to do some research

sbueringer · 2021-10-22T18:29:04Z

/cc @chrischdi Not sure if it's already implemented in CAPO, but it sounds like the same problem.

vincepri · 2022-02-11T17:41:48Z

/milestone v1.2

@yastij to start a document to outline the scope of the work

fabriziopandini · 2022-09-30T19:22:11Z

/triage accepted
/unassign @randomvariable
/help

k8s-ci-robot · 2022-09-30T19:22:13Z

@fabriziopandini:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/unassign @randomvariable
/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-triage-robot · 2024-01-19T00:58:50Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

lubronzhan · 2024-01-31T07:37:16Z

Taking a look 👀

fabriziopandini · 2024-03-29T13:54:00Z

/close

This issue has not been updated in over 1 year and as discussed above there are provider's specific considerations making it very complex to find a generalized solution, unless someone invests time in this effort.

Also, as noted above there are existing options to allow each provider to handle deletion of resources/explore this problem space, e.g.:

using finalizers
using lifecycle hooks

We can re-open (or re-create) whenever there are the conditions and the required community consensus to work on it

k8s-ci-robot · 2024-03-29T13:54:06Z

@fabriziopandini: Closing this issue.

In response to this:

/close

This issue has not been updated in over 1 year and as discussed above there are provider's specific considerations making it very complex to find a generalized solution, unless someone invests time in this effort.

Also, as noted above there are existing options to allow each provider to handle deletion of resources/explore this problem space, e.g.:

using finalizers

using lifecycle hooks

We can re-open (or re-create) whenever there are the conditions and the required community consensus to work on it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label May 19, 2020

k8s-ci-robot added this to the Next milestone May 20, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 24, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 24, 2020

CecileRobertMichon mentioned this issue Mar 30, 2021

[WIP] 📖 Add Load Balancer Provider proposal #4389

Closed

k8s-ci-robot assigned randomvariable Oct 22, 2021

sedefsavas mentioned this issue Nov 1, 2021

AWS NLB linger after they're orphaned kubernetes-sigs/cluster-api-provider-aws#1718

Closed

k8s-ci-robot modified the milestones: Next, v1.2 Feb 11, 2022

sedefsavas mentioned this issue May 17, 2022

Cluster deletion fails when a cluster has a LoadBalancer Service type kubernetes-sigs/cluster-api-provider-aws#3478

Closed

fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

fabriziopandini removed this from the v1.2 milestone Jul 29, 2022

fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022

k8s-ci-robot unassigned randomvariable Sep 30, 2022

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Sep 30, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Jan 19, 2024

k8s-ci-robot closed this as completed Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drain workload clusters of service Type=Loadbalancer on teardown #3075

Drain workload clusters of service Type=Loadbalancer on teardown #3075

randomvariable commented May 19, 2020

vincepri commented May 19, 2020 •

edited

Loading

detiber commented May 19, 2020

randomvariable commented May 19, 2020 •

edited

Loading

ncdc commented May 20, 2020

vincepri commented May 20, 2020

CecileRobertMichon commented May 20, 2020

randomvariable commented May 21, 2020

voor commented May 26, 2020

fejta-bot commented Aug 24, 2020

vincepri commented Aug 24, 2020

yastij commented Aug 25, 2020

detiber commented Aug 26, 2020

randomvariable commented Aug 26, 2020

maplain commented Oct 8, 2020

maplain commented Oct 21, 2020

detiber commented Oct 21, 2020

maplain commented Oct 21, 2020 •

edited

Loading

detiber commented Oct 22, 2020

maplain commented Oct 23, 2020

vincepri commented Oct 22, 2021

sbueringer commented Oct 22, 2021

vincepri commented Feb 11, 2022

fabriziopandini commented Sep 30, 2022

k8s-ci-robot commented Sep 30, 2022

k8s-triage-robot commented Jan 19, 2024

lubronzhan commented Jan 31, 2024

fabriziopandini commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

Drain workload clusters of service Type=Loadbalancer on teardown #3075

Drain workload clusters of service Type=Loadbalancer on teardown #3075

Comments

randomvariable commented May 19, 2020

vincepri commented May 19, 2020 • edited Loading

detiber commented May 19, 2020

randomvariable commented May 19, 2020 • edited Loading

ncdc commented May 20, 2020

vincepri commented May 20, 2020

CecileRobertMichon commented May 20, 2020

randomvariable commented May 21, 2020

voor commented May 26, 2020

fejta-bot commented Aug 24, 2020

vincepri commented Aug 24, 2020

yastij commented Aug 25, 2020

detiber commented Aug 26, 2020

randomvariable commented Aug 26, 2020

maplain commented Oct 8, 2020

maplain commented Oct 21, 2020

detiber commented Oct 21, 2020

maplain commented Oct 21, 2020 • edited Loading

detiber commented Oct 22, 2020

maplain commented Oct 23, 2020

vincepri commented Oct 22, 2021

sbueringer commented Oct 22, 2021

vincepri commented Feb 11, 2022

fabriziopandini commented Sep 30, 2022

k8s-ci-robot commented Sep 30, 2022

Guidelines

k8s-triage-robot commented Jan 19, 2024

lubronzhan commented Jan 31, 2024

fabriziopandini commented Mar 29, 2024

k8s-ci-robot commented Mar 29, 2024

vincepri commented May 19, 2020 •

edited

Loading

randomvariable commented May 19, 2020 •

edited

Loading

maplain commented Oct 21, 2020 •

edited

Loading