Skip to content

MachineHealthCheck unable to remediate unreachable node with volumes attached #10661

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mjlshen opened this issue May 22, 2024 · 2 comments · Fixed by #10662
Closed

MachineHealthCheck unable to remediate unreachable node with volumes attached #10661

mjlshen opened this issue May 22, 2024 · 2 comments · Fixed by #10662
Assignees
Labels
area/machine Issues or PRs related to machine lifecycle management kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@mjlshen
Copy link
Contributor

mjlshen commented May 22, 2024

What steps did you take and what happened?

  • Create a CAPA cluster with at least one machine/node
  • Apply a machinehealthcheck that attempts to remediate machines when nodes stop reporting status
spec:
  maxUnhealthy: 2
  unhealthyConditions:
  - status: Unknown
    timeout: 8m0s
    type: Ready
  • Run a pod on the cluster that mounts a persistent volume
  • Stop the underlying EC2 instance in AWS
  • Observe that the DrainingSucceeded status condition on the machine reports status: "True" after the skipWaitForDelete timeout during the drain is exceeded (
    if noderefutil.IsNodeUnreachable(node) {
    // When the node is unreachable and some pods are not evicted for as long as this timeout, we ignore them.
    drainer.SkipWaitForDeleteTimeoutSeconds = 60 * 5 // 5 minutes
    }
    )
  • The machine is then stuck in a deleting state forever because the volume is not detached

What did you expect to happen?

When a machinehealthcheck is attempting to remediate a machine when its underlying EC2 instance is stopped, I expect that it will successfully drain the node/replace the machine.

Cluster API version

1.7.1

Kubernetes version

v1.27.13+e709aa5

Anything else you would like to add?

I believe that we can address this by setting GracePeriodSeconds: 1 like OpenShift's machinehealthcheck controller:

because for unreachable nodes, deleting pods with a specified grace period will allow for successful volume detachment.

Label(s) to be applied

/kind bug
/area machine

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/machine Issues or PRs related to machine lifecycle management needs-priority Indicates an issue lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 22, 2024
@enxebre
Copy link
Member

enxebre commented May 22, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 22, 2024
@typeid
Copy link
Contributor

typeid commented May 22, 2024

/assign

@fabriziopandini fabriziopandini added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jun 5, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates an issue lacks a `priority/foo` label and requires one. label Jun 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/machine Issues or PRs related to machine lifecycle management kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants