MachineHealthCheck unable to remediate unreachable node with volumes attached #10661
Labels
area/machine
Issues or PRs related to machine lifecycle management
kind/bug
Categorizes issue or PR as related to a bug.
priority/important-soon
Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
triage/accepted
Indicates an issue or PR is ready to be actively worked on.
What steps did you take and what happened?
DrainingSucceeded
status condition on the machine reportsstatus: "True"
after theskipWaitForDelete
timeout during the drain is exceeded (cluster-api/internal/controllers/machine/machine_controller.go
Lines 672 to 675 in a2b7dd1
What did you expect to happen?
When a machinehealthcheck is attempting to remediate a machine when its underlying EC2 instance is stopped, I expect that it will successfully drain the node/replace the machine.
Cluster API version
1.7.1
Kubernetes version
v1.27.13+e709aa5
Anything else you would like to add?
I believe that we can address this by setting
GracePeriodSeconds: 1
like OpenShift's machinehealthcheck controller:cluster-api/internal/controllers/machine/machine_controller.go
Lines 672 to 675 in a2b7dd1
because for unreachable nodes, deleting pods with a specified grace period will allow for successful volume detachment.
Label(s) to be applied
/kind bug
/area machine
The text was updated successfully, but these errors were encountered: