MachineHealthCheck unable to remediate unreachable node with volumes attached #10661

mjlshen · 2024-05-22T16:26:20Z

What steps did you take and what happened?

Create a CAPA cluster with at least one machine/node
Apply a machinehealthcheck that attempts to remediate machines when nodes stop reporting status

spec:
  maxUnhealthy: 2
  unhealthyConditions:
  - status: Unknown
    timeout: 8m0s
    type: Ready

Run a pod on the cluster that mounts a persistent volume
Stop the underlying EC2 instance in AWS

Observe that the DrainingSucceeded status condition on the machine reports status: "True" after the skipWaitForDelete timeout during the drain is exceeded (

cluster-api/internal/controllers/machine/machine_controller.go

Lines 672 to 675 in a2b7dd1

    
           if noderefutil.IsNodeUnreachable(node) { 
        
           	// When the node is unreachable and some pods are not evicted for as long as this timeout, we ignore them. 
        
           	drainer.SkipWaitForDeleteTimeoutSeconds = 60 * 5 // 5 minutes 
        
           }

)

The machine is then stuck in a deleting state forever because the volume is not detached

What did you expect to happen?

When a machinehealthcheck is attempting to remediate a machine when its underlying EC2 instance is stopped, I expect that it will successfully drain the node/replace the machine.

Cluster API version

1.7.1

Kubernetes version

v1.27.13+e709aa5

Anything else you would like to add?

I believe that we can address this by setting GracePeriodSeconds: 1 like OpenShift's machinehealthcheck controller:

OpenShift: https://github.com/openshift/machine-api-operator/blob/dcf1387cb69f8257345b2062cff79a6aefb1f5d9/pkg/controller/machine/drain_controller.go#L164-L171

CAPI:

cluster-api/internal/controllers/machine/machine_controller.go

Lines 672 to 675 in a2b7dd1

    
           if noderefutil.IsNodeUnreachable(node) { 
        
           	// When the node is unreachable and some pods are not evicted for as long as this timeout, we ignore them. 
        
           	drainer.SkipWaitForDeleteTimeoutSeconds = 60 * 5 // 5 minutes 
        
           }

because for unreachable nodes, deleting pods with a specified grace period will allow for successful volume detachment.

Label(s) to be applied

/kind bug
/area machine

The text was updated successfully, but these errors were encountered:

enxebre · 2024-05-22T16:31:53Z

/triage accepted

typeid · 2024-05-22T20:44:48Z

/assign

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 22, 2024

typeid mentioned this issue May 22, 2024

🐛 Machine deletion skips waiting for volumes detached for unreachable Nodes #10662

Merged

k8s-ci-robot assigned typeid May 22, 2024

This was referenced May 30, 2024

🐛 Speed up ignoring terminating Pods when draining unreachable Nodes #10706

Merged

OCPBUGS-34650: Allow specifying the volume detach timeout for machines via NodePools openshift/hypershift#4136

Merged

fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 5, 2024

k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jun 5, 2024

k8s-ci-robot closed this as completed in #10662 Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachineHealthCheck unable to remediate unreachable node with volumes attached #10661

MachineHealthCheck unable to remediate unreachable node with volumes attached #10661

mjlshen commented May 22, 2024

enxebre commented May 22, 2024

typeid commented May 22, 2024

MachineHealthCheck unable to remediate unreachable node with volumes attached #10661

MachineHealthCheck unable to remediate unreachable node with volumes attached #10661

Comments

mjlshen commented May 22, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

enxebre commented May 22, 2024

typeid commented May 22, 2024