Skip to content

CAPi doesnt wait for CSI volume unbinding #4707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MaxRink opened this issue May 31, 2021 · 8 comments · Fixed by #4945
Closed

CAPi doesnt wait for CSI volume unbinding #4707

MaxRink opened this issue May 31, 2021 · 8 comments · Fixed by #4945
Labels
kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@MaxRink
Copy link
Contributor

MaxRink commented May 31, 2021

What steps did you take and what happened:
Ive rolled my workers which had volumes provisioned by the vsphere CSI attached.
Some of those volumes did not detach properly as CAPI was too fast and removed the nodes before the csi controller could fully detach the volumes.

What did you expect to happen:
CAPI waits until volumes are detached.

Anything else you would like to add:
We had a small discussion in the CAPV slack ( https://kubernetes.slack.com/archives/CKFGK3SSD/p1622198292045000 )
@jzhoucliqr has already an patch ( spectrocloud@c340e68 )

Environment:

  • Cluster-api version: alpha3

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label May 31, 2021
@enxebre
Copy link
Member

enxebre commented May 31, 2021

@MaxRink what does signal the csi controller to detach the volumes? Should the vSphereMachine controller ensure detach happens gracefully orthogonally to the csi controller?

@MaxRink
Copy link
Contributor Author

MaxRink commented May 31, 2021

The pod eviction, after the pods are stopped the csi controller will detach the voumes so they can be rebound on another node.
Right now, as far as capi is concerned, after all pods are terminated the node can be safely deleted, getting in the way of the csi nod deamonset and the controller coordinating the volume detachment

@fabriziopandini
Copy link
Member

fabriziopandini commented Jun 1, 2021

I'm not sure this is really a CAPI bug, given that CAPI is not aware of the type of storage providers in use in each cluster.
Nevertheless, if I got the problem right, machine lifecycle hooks could be a valid solution here; they provide an extension point each provider/each User can exploit to check additional conditions before allowing node deletion, like in this case CSI volumes cleanup.
cc @yastij @gab-satchi

@yastij
Copy link
Member

yastij commented Jun 1, 2021

generally speaking I think CAPI should ensure that volumes are properly detached before deleting the machines. That can be done regardless of the storage provider e.g. @jzhoucliqr.

@randomvariable - IIRC CAPA do not have this problem as non-root volumes are detached and preserved on instance termination right ?

@MaxRink
Copy link
Contributor Author

MaxRink commented Jun 1, 2021

Its not only the vsphere CSI which might have that issue btw.
Other providers like NetApp Trident, Pure PSO and other iSCSI based provisionbers also rely on properly detaching before node deletion.

@vincepri
Copy link
Member

vincepri commented Jun 1, 2021

@yastij Are you thinking about inspecting a node and relative Volumes' status?

@vincepri
Copy link
Member

vincepri commented Jul 6, 2021

/lifecycle awaiting-more-evidence
/milestone Next

@vincepri
Copy link
Member

/milestone v0.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants