-
Notifications
You must be signed in to change notification settings - Fork 1.4k
CAPI waiting forever for the volume to be detached #6285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think this happens because we use ignore all daemonsets: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller.go#L532 I don't think we can just change the hard-coded value to false, maybe we should make it configurable. |
This is intrinsic to DaemonSets, even if the property was false, drain would just be blocked but it won't graceful terminate the pod since it'd be automatically recreated by the DaemonSet. |
What about making it configurable whether we wait for the volumes to be all detached or not. Would it be problematic to delete a node where a daemonset still has a volume attached ? Once the node is deleted, the pod will be deleted and the pvc too. Would this be too risky ? |
I think this probably depends on the infrastructure provider. I saw some weird issues in cases like this with OpenStack. But this might be due to that specific OpenStack. In the OpenStack case the CSI pod running on that node would be responsible to unmount etc. (I don't remember the details). Not sure how safe this is in general and how graceful this will be handled by CSI and the infrastructure if a node/server is just deleted with attached volumes. |
I think this old issue kubernetes/kubernetes#54368 (comment) could be relevant here. It is about StatefulSets, but a DaemonSet with volumes is coming quite close. From a safety perspective I guess the best thing would be to shutdown the node but then we are already in infrastructure provider territory. |
The draining of Daemonsets had been discussed in kubernetes/kubernetes#75482 upstream k8s repo issue for a long time apparently but it has always been postponed (first been milestoned to v1.16) up until now. |
You can use As for what cluster autoscaler does, we could consider doing something similar i.e taint deleting Nodes and evict DS pods manually. I'd question how strong is the use case as to support this opinionatedly vs enabling a consumer to do it as in #6285 (comment) To get more context can you please elaborate your use case for running a DS with a volume? |
We have the node drain timeout, but CAPI is still blocking on the volumes even after the timeout has expired. Maybe that could be modified so that we don't wait for the volumes to be detached after the timeout has expired ? Regarding the use-case, we don't have much info. It is a customer using those and they did not give us any specific information. Extending the timeout to the volumes would be acceptable solution for us. |
@maelk By looking a the code I'd be surprised if that's the case, can you validate/confirm share logs? |
/milestone v1.2 |
/priority awaiting-more-evidence |
/milestone Next |
Sorry for the delay, I was on PTO. I verified the inputs we got, and actually the node drain timeout was not set. So you are right about the current behaviour. |
What we were thinking that could be doable is, introduce a new timeout, i.e called |
Can you elaborate on this so we can understand the full picture? |
@enxebre great, thanks! I will try to put a draft PR up soon. |
Here is our use case with Rook: |
Sorry for getting late to this issue. Also, a couple of comments about API changes if we go down this path:
|
@fabriziopandini correct.
I will look into adding it to #6413 which is on the flight now
First part is covered I believe, would you mind elaborating more on contract update part?
👍 |
/assign |
I agree with @enxebre that it seems we are implementing a knob for a use case that can be solved in other ways. |
Today we stick to drain's default behaviour and that prevents daemonSets from being evicted. I created this #6421 to document current state of things. Now, we should discuss if we want to implement daemonset eviction #6158. If so we need to think how to make it transitionally as this would be an API breaking change for existing behaviour. Alternatively, If there're arguments against supporting evicting daemonSets as core, this could be done via hook implementation. |
@fabriziopandini @sbueringer @vincepri @CecileRobertMichon I would appreciate your opinions on how to move forward with this issue.
@enxebre thanks, that is helpful!
I am not sure I got the point here, are you referring to introducing a new hook or making use of the existing ones, if the latter, from my understanding, |
@enxebre is there anything missing in this issue or we can maybe remove this label? |
How would an implementation that drains DaemonSets look like? Regarding the variant with drain: is it something like: our machine controller would set a taint and then delete all DaemonSet pods? (the taint would be so the DaemonSet pods are not getting rescheduled). Would we just evict all DaemonSet Pods? As far as I'm aware Pod deletion has a dependency on CNI working. Afaik if CNI is gone (i.e. the Calico pod has been drained) the kubelet will fail to delete Pods with an error like "Failed to stop Pod sandbox" (IIRC). I think we would also have a problem if the CSI Pod is drained too early. I'm not sure but it sounds like we need a mechanism to decide either a) which DaemonSet pods are getting drained b) in which order all DaemonSet pods are drained. I didn't have the time to read through the upstream issues, but given that they haven't been implemented yet, is anyone aware of concerns that are showstoppers for CAPI? If there is no showstopper upstream, should we consider implementing it there instead and then using it via the drain helper that we already use today? In any case if we start evicting DaemonSet pods we have to think about how we handle the change in behavior, the only option I see is to add a new field to enable it and disable it per default. |
@enxebre @sbueringer we have checked this again and confirmed that this problem still exists in a case where we have a deployment pod with a volume attached to it and that pod is NOT owned by a DaemonSet, so the above confirmation where we say So, in short, it is not only DaemonSet specific issue but rather a wider one where CAPI waits for volumes to be detached when the volume is attached to any resource (could be DS, deployment pod). Here, I am attaching more logs from the cluster where we have seen CAPI blocking the machine deletion (machine stuck in “Deleting” state forever) and hope this helps to better understand the problem. ================================================================================================== We have a cluster where ceph is used as a storage CSI. Ceph runs on master nodes (3) and container mounting a network-file PVC in RWX mode on a worker using ceph.
Below, you can see the YAML definitions of the resources involved in the test case: Pod YAML:
Persistent volume YAML:
and PVC YAML attached to the pod:
Machine stuck in Deleting state:
Corresponding Node:
CAPI logs while trying to delete the machine:
and it keeps throwing the same However, if the node is deleted manually, the node is gone in this situation and CAPI won’t check for the volume anymore and CAPI proceeds with deletion of the machine as expected:
Since this is not only specific to DS (up until now we were thinking it is the case) but it is wider and generic (other resources with volume not only DS), how do you feel/like the idea of introducing a volumeDetachTimeout we had initially and been discussing in #6285 (comment)? |
Thanks for sharing the context above @furkatgofurov7
In the case of a Pod, this is happening because the pod is legitimately evicted but nothing is handling volumes detachment, correct? This seems safe and I'd expect you to handle the retirement of that volume via your cloud provider/KCM/storage automation/manually as enabled by #4945. Could you clarify on why is that not happening? Where is the daemon/set pod with the attached volume coming from, why is not root taking care of detaching it? I'm trying to picture how much of is this something specific to your particular setup vs a generic need as for CAPI to enable a disruptive volume time out beyond NodeDrainTimeout |
The attachment comes from pod object. As in k8s volume attachment is controlled by pod lifecycle.
So, detaching can only happen if pod gets deleted. Pod might not get deleted if PDB prevents it or if it is pod with daemonset tolerations. There should be no other case as long as drain taint is set correctly and nothing delete it. This then prevent that pods drained can not be scheduled to the node and no new pods can be scheduled. Pods with PDB do not necessary have PVC, but generally only statefull applications need PDB. The need to wait volumes to detach is really specific for vsphere and not applicable for other storage backends. I would assume this volume wait to be optional feature as it changes default behaviour of k8s drain. For vsphere it makes sense to have functionality delay machine deletion until all volumes are detached as otherwise the volumes would get deleted. However, does this belong to CAPI or should be vsphere specific controller or its CSI add own finalizer to Machine object and would remove it only after all volumes where detached. As it would also block the CAPI from deleting the machine until it is ready for deletion. This approach would keep the CAPI code more generic and would not have vendor specific code. But at least there should be possibility disable this volume wait as it prevents some k8s functionality that is supported otherwise. |
@mape90 IMO the drain wait is not specific for vsphere. |
@MaxRink But yes it is nice if all volumes would be detached before stopping the node. But stopping the node before detaching the volume should not be issue. It would be just same as power loss and volumes should not get corrupted or you have consistency issue in your filesystem or database. And as long as the Machine is shutdown before Node object is deleted the volumes will not be used by multiple nodes at same time. As volumeattachments still are there and they will block the multi-attach if volume is RWO. So, I would still say option to disable volume detach wait would be beneficial to enable daemonsets with volumes. Or you would have specific timeout for volume wait. As NodeDrainTimeout is not something that should be used in healthy production clusters. As it will break applications running on k8s as it ignores PDBs. |
@MaxRink I think there is still a valid case as explained above by @mape90 where there would be a need for some kind of timeout where volume detachment is impossible. Although that concern was raised by @vincepri originally in #4945, and I understand it was stated that without setting nodeDrainTimeout can lead to a situation where things might be pending until manual intervention. But given that there are use cases where giving up waiting for volume detachment after a certain timeout is desirable, why not to add it as an optional possibility? |
Thanks for detailed elaboration @mape90.
I don't think the drain process is changed in any way by CAPI, we rely on "k8s.io/kubectl/pkg/drain". I'd say WaitForNodeVolumes is an opinionated phase of CAPI Machine deletion lifecycle, which is coupled to our drainTimeout API semantic.
I think this is a valid point.
So can you confirm 5, 6, 7 and 8 are not happening in your case because of two different scenarios: I wouldn't be opposed to relax the assumption about cloud providers and introduce a volumeDetachTimeout decoupled from drainTimeout (Particularly for A if you can confirm the scenario @mape90) which defaults to infinite so it keeps backward compatibility and is defensive by default. I'd like to hear pov from other providers as well cc @CecileRobertMichon @richardcase |
A) yes as long as there is Pod on node all attachments are kept there. This is k8s way to protect volumes from been used by multiple nodes at once. In case node can not communicate to API then volumes stay attached to node forever. In case of failure it is assumed that something checks that the node is actually down before removing the Node object. B) logic is same as in A, however here the the pod is just never planed to be moved out. However also in this case the volume has to be RWX so there is no risk in multi-attaching as that is supported. |
@enxebre hi! is there anything that needs to be clarified/missing to move forward with this issue? |
@sbueringer @enxebre Can we triage this issue and go for a solution for this? It would be nice to move forward with this issue since we are facing it in one of our deployments. |
apologies for the delay, I'm ok to add the field as in #6285 (comment) cc @sbueringer @fabriziopandini |
Sounds fine to me as well. We only have to discuss and document how the |
Frankly speaking, I'm still confused by the fact that both existing nodeDrainTimeout and proposed volumeDrainTimeout are going to delete the node with attached volumes, so I'm not really sure this will actually solve the issue given that both the setting will be the same on all the machines, which was the original objection to use nodeDrainTimeout. Said that my consideration about adding a new field are the same already expressed in #6285 (comment); I'm mostly concerned about usability - make it clear for the users what the knobs are for - and API design - let's have a well organized API surface, or at least an idea on how to get there in the next API release - |
I see the point in not wanting to break the actual node drain + PDB behavior and being able to have an independent timeout only for the volume detach. But as I said we have to find a good way how both timeouts can work together. One idea:
This would give us 3 subsequent "actions" with their corresponding timeouts: node drain, wait for volume detach, node delete The current state is not great, we just introduced |
@fabriziopandini I thought we were clear about why we were introducing the new timeout other than the nodeDrainTimeout. As @sbueringer pointed out, waitingForVolumeDetachment indefinitely thinking it works for everyone is not a good idea and for that reason, we came up with new |
I have unhold #6413, thanks for driving this effort |
What steps did you take and what happened:
We have a use case where we are running a daemon set that mounts a volume. When draining the node, CAPI does not touch the daemon sets. Due to #4945 CAPI waits for the volumes to be detached and that is not an issue normally since all pods are deleted when draining. But since volumes are attached to the daemon set, they are never unmounted because the pod keeps running, which results in CAPI waiting forever for the volumes to be detached.
What did you expect to happen:
Draining DS and detaching the volume successfully without deadlock
Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]
Environment:
kubectl version
): 1.23.3/etc/os-release
): SLES 15SP2/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]
The text was updated successfully, but these errors were encountered: