Skip to content

CAPI waiting forever for the volume to be detached #6285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
furkatgofurov7 opened this issue Mar 10, 2022 · 53 comments · Fixed by #6413
Closed

CAPI waiting forever for the volume to be detached #6285

furkatgofurov7 opened this issue Mar 10, 2022 · 53 comments · Fixed by #6413
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.

Comments

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Mar 10, 2022

What steps did you take and what happened:

We have a use case where we are running a daemon set that mounts a volume. When draining the node, CAPI does not touch the daemon sets. Due to #4945 CAPI waits for the volumes to be detached and that is not an issue normally since all pods are deleted when draining. But since volumes are attached to the daemon set, they are never unmounted because the pod keeps running, which results in CAPI waiting forever for the volumes to be detached.

What did you expect to happen:
Draining DS and detaching the volume successfully without deadlock

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Cluster-api version: v0.4..4
  • Minikube/KIND version: NA
  • Kubernetes version: (use kubectl version): 1.23.3
  • OS (e.g. from /etc/os-release): SLES 15SP2

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 10, 2022
@furkatgofurov7
Copy link
Member Author

@sbueringer
Copy link
Member

I think this happens because we use ignore all daemonsets: https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller.go#L532

I don't think we can just change the hard-coded value to false, maybe we should make it configurable.

@enxebre
Copy link
Member

enxebre commented Mar 10, 2022

This is intrinsic to DaemonSets, even if the property was false, drain would just be blocked but it won't graceful terminate the pod since it'd be automatically recreated by the DaemonSet.
To fully remove the Pod from that Node you could leverage labels/NodeSelectors so this might come in handy #6255 for this scenario.

@maelk
Copy link
Contributor

maelk commented Mar 10, 2022

What about making it configurable whether we wait for the volumes to be all detached or not. Would it be problematic to delete a node where a daemonset still has a volume attached ? Once the node is deleted, the pod will be deleted and the pvc too. Would this be too risky ?

@sbueringer
Copy link
Member

sbueringer commented Mar 10, 2022

I think this probably depends on the infrastructure provider. I saw some weird issues in cases like this with OpenStack. But this might be due to that specific OpenStack.

In the OpenStack case the CSI pod running on that node would be responsible to unmount etc. (I don't remember the details). Not sure how safe this is in general and how graceful this will be handled by CSI and the infrastructure if a node/server is just deleted with attached volumes.

@lentzi90
Copy link
Contributor

I think this old issue kubernetes/kubernetes#54368 (comment) could be relevant here. It is about StatefulSets, but a DaemonSet with volumes is coming quite close.

From a safety perspective I guess the best thing would be to shutdown the node but then we are already in infrastructure provider territory.

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Mar 11, 2022

The draining of Daemonsets had been discussed in kubernetes/kubernetes#75482 upstream k8s repo issue for a long time apparently but it has always been postponed (first been milestoned to v1.16) up until now.
On the other hand, there is #6158 which is similar to what we have been discussing here.

@enxebre
Copy link
Member

enxebre commented Mar 15, 2022

You can use .NodeDrainTimeout to bypass draining after a timeout, the machine controller will proceed to delete the Node even though it has volumes attached. That would have the pertinent implications in each particular platform.

As for what cluster autoscaler does, we could consider doing something similar i.e taint deleting Nodes and evict DS pods manually. I'd question how strong is the use case as to support this opinionatedly vs enabling a consumer to do it as in #6285 (comment)

To get more context can you please elaborate your use case for running a DS with a volume?
Also wil

@maelk
Copy link
Contributor

maelk commented Mar 15, 2022

We have the node drain timeout, but CAPI is still blocking on the volumes even after the timeout has expired. Maybe that could be modified so that we don't wait for the volumes to be detached after the timeout has expired ?

Regarding the use-case, we don't have much info. It is a customer using those and they did not give us any specific information.

Extending the timeout to the volumes would be acceptable solution for us.

@enxebre
Copy link
Member

enxebre commented Mar 17, 2022

We have the node drain timeout, but CAPI is still blocking on the volumes even after the timeout has expired.

@maelk By looking a the code I'd be surprised if that's the case, can you validate/confirm share logs?
https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller.go#L309-L353

@fabriziopandini
Copy link
Member

/milestone v1.2

@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone Mar 22, 2022
@enxebre
Copy link
Member

enxebre commented Mar 28, 2022

/priority awaiting-more-evidence
to hear from #6285 (comment)

@k8s-ci-robot k8s-ci-robot added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Mar 28, 2022
@vincepri
Copy link
Member

/milestone Next

@maelk
Copy link
Contributor

maelk commented Apr 4, 2022

Sorry for the delay, I was on PTO. I verified the inputs we got, and actually the node drain timeout was not set. So you are right about the current behaviour.
Our users do not want to set it because Rook is deployed and using timeouts would override some internal control mechanisms. Could we maybe consider splitting the timeouts, or adding another control knob to change the timeout behaviour for the volume part only ?

@furkatgofurov7
Copy link
Member Author

@enxebre @vincepri how do you like the idea of having split the timeouts? like having a general drain timeout + volumes timeout, and within that, a timeout specifically for volumes?

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Apr 6, 2022

/cc @sbueringer @fabriziopandini

@furkatgofurov7
Copy link
Member Author

What we were thinking that could be doable is, introduce a new timeout, i.e called volumeDrainTimeout specifically for volumes,
and then, whenever we check whether we should wait for volumes we would have a volumeDrainTimeout set, so in a case where we have nodeDrainTimeout is set to 0 (meaning wait indefinitely and that is the scenario we are dealing with) and we have a volume that can never be detached, we could wait until either nodeDrainTimeout OR volumeDrainTimeout exceeds, and based on whichever of the two timeouts exceeds first, we could move on with the node deletion, rather than being stuck waiting.

@enxebre
Copy link
Member

enxebre commented Apr 6, 2022

Our users do not want to set it because Rook is deployed and using timeouts would override some internal control mechanisms.

Can you elaborate on this so we can understand the full picture?
If we consider this a generic valid use case not covered by drainTimeout, then either volumeDrainTimeout or supporting draining daemonsets e.g using taints make sense to me, otherwise this could be a valid use case for a runtime extension hook.

@furkatgofurov7
Copy link
Member Author

then either volumeDrainTimeout or supporting draining daemonsets e.g using taints make sense to me

@enxebre great, thanks! I will try to put a draft PR up soon.

@maelk
Copy link
Contributor

maelk commented Apr 8, 2022

Here is our use case with Rook:
In order to preserve the Ceph cluster, Rook needs to control tightly the deprovisioning of the node, since there is a constraints on OSDs going down, if more than one replica of data is missing, the ceph cluster becomes unusable until it is back to max one replica of the data being gone. So to go through the upgrade, it uses PDB to prevent more than x nodes to go offline at once for reprovisioning. It will prevent the draining of any other node as long as the Ceph cluster is not ready to have more OSDs going down. So until the data is rebalanced and there is room to take another OSD down, the drain of the next node will be blocked.
We do not want to override this mechanism with a drain timeout since it could break our Ceph cluster. However, we have a daemonset that mounts a volume, and currently that blocks the deprovisioning of the node without manual action. If we could set a timeout on that specific part of the deprovisioning without impacting the actual draining, then our problem would be solved.
Do you see another alternative ?

@fabriziopandini
Copy link
Member

fabriziopandini commented Apr 12, 2022

Sorry for getting late to this issue.
I'm not 100% sure I got the difference between forcing deletion due to nodeDrainTimeout expiring or due to a new volumeDrainTimeout, because if I understand the above comments right in both cases we are going to delete the node with attached volumes, or am I wrong?

Also, a couple of comments about API changes if we go down this path:

  • if we are going to add a new timeout, it should go into ClusterTopology as well
  • similarly, it should go in KCP/CP CRs, but this requires a contract update
  • if I look at the machine API we are adding timeouts to spec without a proper organization; wondering if we can do a better job at modeling delete options

/cc @yastij @vincepri

@furkatgofurov7
Copy link
Member Author

in both cases we are going to delete the node with attached volumes, or am I wrong?

@fabriziopandini correct.

  • if we are going to add a new timeout, it should go into ClusterTopology as well

I will look into adding it to #6413 which is on the flight now

similarly, it should go in KCP/CP CRs, but this requires a contract update

First part is covered I believe, would you mind elaborating more on contract update part?

  • f I look at the machine API we are adding timeouts to spec without a proper organization; wondering if we can do a better job at modeling delete options

👍

@furkatgofurov7
Copy link
Member Author

/assign
/lifecycle active

@fabriziopandini
Copy link
Member

I agree with @enxebre that it seems we are implementing a knob for a use case that can be solved in other ways.
I don't have preferences on using something different than DemonSet or using Taints, but I wonder if we should document this somewhere in the book

@enxebre
Copy link
Member

enxebre commented Apr 18, 2022

but I wonder if we should document this somewhere in the book

Today we stick to drain's default behaviour and that prevents daemonSets from being evicted. I created this #6421 to document current state of things.

Now, we should discuss if we want to implement daemonset eviction #6158. If so we need to think how to make it transitionally as this would be an API breaking change for existing behaviour. Alternatively, If there're arguments against supporting evicting daemonSets as core, this could be done via hook implementation.

@furkatgofurov7
Copy link
Member Author

@fabriziopandini @sbueringer @vincepri @CecileRobertMichon I would appreciate your opinions on how to move forward with this issue.

Today we stick to drain's default behaviour and that prevents daemonSets from being evicted. I created this #6421 to document current state of things.

@enxebre thanks, that is helpful!

Alternatively, If there're arguments against supporting evicting daemonSets as core, this could be done via hook implementation.

I am not sure I got the point here, are you referring to introducing a new hook or making use of the existing ones, if the latter, from my understanding, pre-drain.delete hook is too early and pre-term.delete hook is too late to help with our use case

@furkatgofurov7
Copy link
Member Author

/priority awaiting-more-evidence
to hear from #6285 (comment)

@enxebre is there anything missing in this issue or we can maybe remove this label?

@enxebre enxebre removed the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Apr 28, 2022
@sbueringer
Copy link
Member

sbueringer commented Apr 29, 2022

How would an implementation that drains DaemonSets look like?

Regarding the variant with drain: is it something like: our machine controller would set a taint and then delete all DaemonSet pods? (the taint would be so the DaemonSet pods are not getting rescheduled).
(I'm not sure if just setting a taint evicts already running Pods from a Node)

Would we just evict all DaemonSet Pods? As far as I'm aware Pod deletion has a dependency on CNI working. Afaik if CNI is gone (i.e. the Calico pod has been drained) the kubelet will fail to delete Pods with an error like "Failed to stop Pod sandbox" (IIRC). I think we would also have a problem if the CSI Pod is drained too early. I'm not sure but it sounds like we need a mechanism to decide either a) which DaemonSet pods are getting drained b) in which order all DaemonSet pods are drained.

I didn't have the time to read through the upstream issues, but given that they haven't been implemented yet, is anyone aware of concerns that are showstoppers for CAPI? If there is no showstopper upstream, should we consider implementing it there instead and then using it via the drain helper that we already use today?

In any case if we start evicting DaemonSet pods we have to think about how we handle the change in behavior, the only option I see is to add a new field to enable it and disable it per default.

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented May 13, 2022

am I right to say that your problem would be automatically solved if the pod having the volume were not owned by a daemon set so it would be naturally evicted as everything else letting the volume go away?

Yes, if the volume was not owned by a daemonset, we would not have the issue. Either direction should be fine for us to solve this issue.

@enxebre @sbueringer we have checked this again and confirmed that this problem still exists in a case where we have a deployment pod with a volume attached to it and that pod is NOT owned by a DaemonSet, so the above confirmation where we say there is no issue when we have a volume not owned by a daemonset is not accurate.

So, in short, it is not only DaemonSet specific issue but rather a wider one where CAPI waits for volumes to be detached when the volume is attached to any resource (could be DS, deployment pod).

Here, I am attaching more logs from the cluster where we have seen CAPI blocking the machine deletion (machine stuck in “Deleting” state forever) and hope this helps to better understand the problem.

==================================================================================================

We have a cluster where ceph is used as a storage CSI. Ceph runs on master nodes (3) and container mounting a network-file PVC in RWX mode on a worker using ceph.
Once worker BaremetalHost is deleted (in a chain of : BM server <=> InfraMachine <=> CAPI machine), the worker node becomes NotReady and sets the CAPI machine into Failed state. Then we try to delete the machine after it is in Failed state and then it starts “Deleting” the machine and stuck in that state forever.

# k get machine  
NAMESPACE                    NAME         CLUSTER         NODENAME         PROVIDERID       PHASE    AGE   VERSION
test-machine-namespace     test-machine  test-cluster    test-node-pool   metal3://15xxx   Deleting  10h   v1.23.3

# k get nodes
 NAME            STATUS                        ROLES   AGE   VERSION
test-node-pool   NotReady,SchedulingDisabled   worker  10h   v1.23.3

Below, you can see the YAML definitions of the resources involved in the test case:

Pod YAML:

apiVersion: v1
kind: Pod
metadata:
  
  name: test-deployment-pod
  namespace: kube-system
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: ReplicaSet
    name: test-74fb4cb485
    ………………
nodeName: test-node-pool
  volumes:
  - name: registry-data
    persistentVolumeClaim:
      claimName: pvc-test

Persistent volume YAML:

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: rook-ceph.cephfs.csi.ceph.com
  name: pvc-244556666-n332322
spec:
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pvc-test
    namespace: kube-system
  csi:
    controllerExpandSecretRef:
      name: rook-csi-cephfs-provisioner
      namespace: rook-ceph
    driver: rook-ceph.cephfs.csi.ceph.com
    nodeStageSecretRef:
      name: rook-csi-cephfs-node
      namespace: rook-ceph
status:
  phase: Bound

and PVC YAML attached to the pod:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pvc-test
  namespace: kube-system
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: network-file
  volumeMode: Filesystem
  volumeName: pvc-244556666-n332322
status:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  phase: Bound

Machine stuck in Deleting state:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Machine
metadata:
  name: test-machine
  namespace: test-machine-namespace
spec:
……..
status:
  conditions:
  - lastTransitionTime: "2022-05-12T23:32:06Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-05-12T23:31:51Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2022-05-13T09:01:54Z"
    status: "True"
    type: DrainingSucceeded
  - lastTransitionTime: "2022-05-12T23:32:06Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2022-05-13T08:35:32Z"
    message: 'Node condition MemoryPressure is Unknown. Node condition DiskPressure
      is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown. '
    reason: NodeConditionsFailed
    status: Unknown
    type: NodeHealthy
  - lastTransitionTime: "2022-05-13T09:01:24Z"
    status: "True"
    type: PreDrainDeleteHookSucceeded
  - lastTransitionTime: "2022-05-13T09:01:54Z"
    message: Waiting for node volumes to be detached
    reason: WaitingForVolumeDetach
    severity: Info
    status: "False"
    type: VolumeDetachSucceeded
  infrastructureReady: true
  lastUpdated: "2022-05-13T09:01:24Z"
  nodeInfo:
    architecture: amd64
    containerRuntimeVersion: containerd://1.5.7
    kubeProxyVersion: v1.23.3
    kubeletVersion: v1.23.3
  nodeRef:
    apiVersion: v1
    kind: Node
    name: test-node-pool
    uid: 0xxxxx
  phase: Deleting

Corresponding Node:

apiVersion: v1
kind: Node
metadata:
  name: test-node-pool
spec:
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unreachable
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
  unschedulable: true
status:
  conditions:
    message: Calico is running on this node
    reason: CalicoIsUp
    status: "False"
    type: NetworkUnavailable
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: MemoryPressure
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: DiskPressure
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: PIDPressure
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: Ready
  daemonEndpoints:
    kubeletEndpoint:
      Port: 10250
  
  nodeInfo:
    …..
  volumesAttached:
  - devicePath: ""
    name: kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com-rook-ceph-volume1
  - devicePath: ""
    name: kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com-rook-ceph-volume2
  volumesInUse:
  - kubernetes.io/csi/rook-ceph.cephfs.csi.ceph.com-rook-ceph-volume1
  - kubernetes.io/csi/rook-ceph.rbd.csi.ceph.com-rook-ceph-volume2

CAPI logs while trying to delete the machine:

I0513 09:01:54.844912       1 machine_controller.go:311] controller/machine "msg"="Draining node" "cluster"="test-cluster" "name"="test-machine" "namespace"="test-machine-namespace" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" "node"="test-node-pool"
E0513 09:01:55.527274       1 machine_controller.go:800] 
1 machine_controller.go:603] controller/machine "msg"="Drain successful" "cluster"="test-cluster" "name"="test-machine" "namespace"="test-machine-namespace" "node"="test-node-pool" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" 
I0513 09:01:55.527636       1 machine_controller.go:352] controller/machine "msg"="Waiting for node volumes to be detached" "cluster"="test-cluster" "name"="test-machine" "namespace"="test-machine-namespace" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" "node"="test-node-pool"

and it keeps throwing the same Waiting for node volumes to be detached" output forever with different timestamps.

However, if the node is deleted manually, the node is gone in this situation and CAPI won’t check for the volume anymore and CAPI proceeds with deletion of the machine as expected:

E0513 10:25:53.840191 1 machine_controller.go:558] controller/machine "msg"="Could not find node from noderef, it may have already been deleted" "error"="nodes \"test-node-pool\" not found" "cluster"="test-cluster" "name"="test-machine" "namespace"="test-machine-namespace" "node"="test-node-pool" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine"
E0513 10:25:53.840290 1 machine_controller.go:623] controller/machine "msg"="Could not find node from noderef, it may have already been deleted" "error"="Node \"test-node-pool\" not found" "cluster"="test-cluster" "machine"="test-machine" "name"="test-machine" "namespace"="test-machine-namespace" "node"="test-node-pool" "reconciler group"="cluster.x-k8s.io" "reconciler kind"="Machine" 

Since this is not only specific to DS (up until now we were thinking it is the case) but it is wider and generic (other resources with volume not only DS), how do you feel/like the idea of introducing a volumeDetachTimeout we had initially and been discussing in #6285 (comment)?

@enxebre
Copy link
Member

enxebre commented May 23, 2022

Thanks for sharing the context above @furkatgofurov7

So, in short, it is not only DaemonSet specific issue but rather a wider one where CAPI waits for volumes to be detached when the volume is attached to any resource (could be DS, deployment pod).

In order to preserve the Ceph cluster, Rook needs to control tightly the deprovisioning of the node, since there is a constraints on OSDs going down, if more than one replica of data is missing, the ceph cluster becomes unusable until it is back to max one replica of the data being gone. So to go through the upgrade, it uses PDB to prevent more than x nodes to go offline at once for reprovisioning. It will prevent the draining of any other node as long as the Ceph cluster is not ready to have more OSDs going down. So until the data is rebalanced and there is room to take another OSD down, the drain of the next node will be blocked.
We do not want to override this mechanism with a drain timeout since it could break our Ceph cluster. However, we have a daemonset that mounts a volume

In the case of a Pod, this is happening because the pod is legitimately evicted but nothing is handling volumes detachment, correct? This seems safe and I'd expect you to handle the retirement of that volume via your cloud provider/KCM/storage automation/manually as enabled by #4945. Could you clarify on why is that not happening?

Where is the daemon/set pod with the attached volume coming from, why is not root taking care of detaching it?

I'm trying to picture how much of is this something specific to your particular setup vs a generic need as for CAPI to enable a disruptive volume time out beyond NodeDrainTimeout

@mape90
Copy link

mape90 commented May 31, 2022

Thanks for sharing the context above @furkatgofurov7

So, in short, it is not only DaemonSet specific issue but rather a wider one where CAPI waits for volumes to be detached when the volume is attached to any resource (could be DS, deployment pod).

In order to preserve the Ceph cluster, Rook needs to control tightly the deprovisioning of the node, since there is a constraints on OSDs going down, if more than one replica of data is missing, the ceph cluster becomes unusable until it is back to max one replica of the data being gone. So to go through the upgrade, it uses PDB to prevent more than x nodes to go offline at once for reprovisioning. It will prevent the draining of any other node as long as the Ceph cluster is not ready to have more OSDs going down. So until the data is rebalanced and there is room to take another OSD down, the drain of the next node will be blocked.
We do not want to override this mechanism with a drain timeout since it could break our Ceph cluster. However, we have a daemonset that mounts a volume

In the case of a Pod, this is happening because the pod is legitimately evicted but nothing is handling volumes detachment, correct? This seems safe and I'd expect you to handle the retirement of that volume via your cloud provider/KCM/storage automation/manually as enabled by #4945. Could you clarify on why is that not happening?

Where is the daemon/set pod with the attached volume coming from, why is not root taking care of detaching it?

I'm trying to picture how much of is this something specific to your particular setup vs a generic need as for CAPI to enable a disruptive volume time out beyond NodeDrainTimeout

The attachment comes from pod object. As in k8s volume attachment is controlled by pod lifecycle.
Volume management with CSI PVCs in k8s happens following way:

  1. Pod is created on node.
  2. Attach-detach controller in kube-controller-manager sees new Pod and gets all PVCs belonging to that pod. It then generates volumeattachments for PVCs. (This step is ignored if CSI has attachment disabled)
  3. CSI attacher will detect new volumeattachment and attach the volume.
    ... (pod runs)
  4. Pod is deleted from api by node hosting it. (or user force deletes the pod)
  5. Attach-detach controller in kube-controller-manager detects pod deletion (or if it misses it then every 10min it will check all volumeattachments and see if they still has the pod) and marks volumeattachment for deletion.
  6. CSI detects change on volumeattachment and detaches the volume.
  7. Attach-detach controller in kube-controller-manager deletes the volumeattachment
  8. CSI removes the finalizer from pvc if this was last volumeattachemnt for this pvc.

So, detaching can only happen if pod gets deleted. Pod might not get deleted if PDB prevents it or if it is pod with daemonset tolerations. There should be no other case as long as drain taint is set correctly and nothing delete it. This then prevent that pods drained can not be scheduled to the node and no new pods can be scheduled.

Pods with PDB do not necessary have PVC, but generally only statefull applications need PDB.

The need to wait volumes to detach is really specific for vsphere and not applicable for other storage backends. I would assume this volume wait to be optional feature as it changes default behaviour of k8s drain.

For vsphere it makes sense to have functionality delay machine deletion until all volumes are detached as otherwise the volumes would get deleted. However, does this belong to CAPI or should be vsphere specific controller or its CSI add own finalizer to Machine object and would remove it only after all volumes where detached. As it would also block the CAPI from deleting the machine until it is ready for deletion. This approach would keep the CAPI code more generic and would not have vendor specific code.

But at least there should be possibility disable this volume wait as it prevents some k8s functionality that is supported otherwise.

@MaxRink
Copy link
Contributor

MaxRink commented May 31, 2022

@mape90 IMO the drain wait is not specific for vsphere.
Other CSI providers also rely on this.
And also for those that dont, forcefully unbinding block volumes while they still are in use is not a good thing, as that can lead to data or filesystem corruption.

@mape90
Copy link

mape90 commented Jun 1, 2022

@MaxRink
Vsphere is quite special as there the VMDK disk is destroyed if it is attached to node when node is deleted. Usually cloud providers do not destroy attached disks when node is deleted. They just detach them. So for Vsphere the detach wait is non optional.

But yes it is nice if all volumes would be detached before stopping the node. But stopping the node before detaching the volume should not be issue. It would be just same as power loss and volumes should not get corrupted or you have consistency issue in your filesystem or database.

And as long as the Machine is shutdown before Node object is deleted the volumes will not be used by multiple nodes at same time. As volumeattachments still are there and they will block the multi-attach if volume is RWO.

So, I would still say option to disable volume detach wait would be beneficial to enable daemonsets with volumes. Or you would have specific timeout for volume wait. As NodeDrainTimeout is not something that should be used in healthy production clusters. As it will break applications running on k8s as it ignores PDBs.

@furkatgofurov7
Copy link
Member Author

@MaxRink I think there is still a valid case as explained above by @mape90 where there would be a need for some kind of timeout where volume detachment is impossible. Although that concern was raised by @vincepri originally in #4945, and I understand it was stated that without setting nodeDrainTimeout can lead to a situation where things might be pending until manual intervention. But given that there are use cases where giving up waiting for volume detachment after a certain timeout is desirable, why not to add it as an optional possibility?
@enxebre @vincepri @sbueringer any thoughts on this?

@enxebre
Copy link
Member

enxebre commented Jun 6, 2022

Thanks for detailed elaboration @mape90.

I would assume this volume wait to be optional feature as it changes default behaviour of k8s drain.

I don't think the drain process is changed in any way by CAPI, we rely on "k8s.io/kubectl/pkg/drain". I'd say WaitForNodeVolumes is an opinionated phase of CAPI Machine deletion lifecycle, which is coupled to our drainTimeout API semantic.

But yes it is nice if all volumes would be detached before stopping the node. But stopping the node before detaching the volume should not be issue. It would be just same as power loss and volumes should not get corrupted or you have consistency issue in your filesystem or database.

I think this is a valid point.

In the case of a Pod, this is happening because the pod is legitimately evicted but nothing is handling volumes detachment, correct?

So can you confirm 5, 6, 7 and 8 are not happening in your case because of two different scenarios:
A - The Pod has a volume attached and the host backed by the Machine was shutdown out of band, there's no ability to detach?
B - The Pod is owned by a DaemonSet, hence is not evicted.

I wouldn't be opposed to relax the assumption about cloud providers and introduce a volumeDetachTimeout decoupled from drainTimeout (Particularly for A if you can confirm the scenario @mape90) which defaults to infinite so it keeps backward compatibility and is defensive by default. I'd like to hear pov from other providers as well cc @CecileRobertMichon @richardcase

@mape90
Copy link

mape90 commented Jun 7, 2022

So can you confirm 5, 6, 7 and 8 are not happening in your case because of two different scenarios:
A - The Pod has a volume attached and the host backed by the Machine was shutdown out of band, there's no ability to detach?
B - The Pod is owned by a DaemonSet, hence is not evicted.

A) yes as long as there is Pod on node all attachments are kept there. This is k8s way to protect volumes from been used by multiple nodes at once. In case node can not communicate to API then volumes stay attached to node forever. In case of failure it is assumed that something checks that the node is actually down before removing the Node object.

B) logic is same as in A, however here the the pod is just never planed to be moved out. However also in this case the volume has to be RWX so there is no risk in multi-attaching as that is supported.

@furkatgofurov7
Copy link
Member Author

@enxebre hi! is there anything that needs to be clarified/missing to move forward with this issue?

@kashifest
Copy link
Contributor

@sbueringer @enxebre Can we triage this issue and go for a solution for this? It would be nice to move forward with this issue since we are facing it in one of our deployments.

@enxebre
Copy link
Member

enxebre commented Jul 4, 2022

apologies for the delay, I'm ok to add the field as in #6285 (comment) cc @sbueringer @fabriziopandini

@sbueringer
Copy link
Member

Sounds fine to me as well.

We only have to discuss and document how the volumeDetachTimeout works in combination with the nodeDrainTimeout.

@fabriziopandini
Copy link
Member

fabriziopandini commented Jul 4, 2022

Frankly speaking, I'm still confused by the fact that both existing nodeDrainTimeout and proposed volumeDrainTimeout are going to delete the node with attached volumes, so I'm not really sure this will actually solve the issue given that both the setting will be the same on all the machines, which was the original objection to use nodeDrainTimeout.

Said that my consideration about adding a new field are the same already expressed in #6285 (comment); I'm mostly concerned about usability - make it clear for the users what the knobs are for - and API design - let's have a well organized API surface, or at least an idea on how to get there in the next API release -

@sbueringer
Copy link
Member

sbueringer commented Jul 5, 2022

I see the point in not wanting to break the actual node drain + PDB behavior and being able to have an independent timeout only for the volume detach. But as I said we have to find a good way how both timeouts can work together.

One idea:

  • nodeDrainTimout => only applies to the actual drain (without volume detach)
  • volumeDetachTimeout => only applies to the wait for volume detach

This would give us 3 subsequent "actions" with their corresponding timeouts: node drain, wait for volume detach, node delete

The current state is not great, we just introduced shouldWaitForNodeVolumes without making it optional with the assumption that it works for everyone. Consequence is that it breaks drain+delete behavior for folks.

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Jul 18, 2022

@fabriziopandini I thought we were clear about why we were introducing the new timeout other than the nodeDrainTimeout. As @sbueringer pointed out, waitingForVolumeDetachment indefinitely thinking it works for everyone is not a good idea and for that reason, we came up with new volumeDetachTimeout timeout. Let me know what you think. Perhaps we can unhold #6413 and I could work out the remaining pieces from there on?

@fabriziopandini fabriziopandini added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini fabriziopandini removed this from the Next milestone Jul 29, 2022
@fabriziopandini fabriziopandini removed the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 29, 2022
@fabriziopandini
Copy link
Member

I have unhold #6413, thanks for driving this effort

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/active Indicates that an issue or PR is actively being worked on by a contributor.
Projects
None yet
Development

Successfully merging a pull request may close this issue.