Skip to content

graceful node shutdown restarts pods during shutdown #100184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yvespp opened this issue Mar 12, 2021 · 5 comments
Closed

graceful node shutdown restarts pods during shutdown #100184

yvespp opened this issue Mar 12, 2021 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@yvespp
Copy link
Contributor

yvespp commented Mar 12, 2021

Original discussion was here kubernetes/website#26963 but I created this new issue because it has nothing to do with the website/docs.

@bobbypage here is the new issue, thanks for your help!

What happened:

Whit GracefulNodeShutdown enabled when I stop a node pods on it get deleted but then get startet on the same node again.
Maybe it's because the node is not marked as NotReady before pods are deleted.

What you expected to happen:

When the node is shut down Pods should be deleted and then scheduled to another node or remain in pending state.
When node is started again Pods should be scheduled to that node again.

How to reproduce it (as minimally and precisely as possible):

I tested this on a cluster with 2 worker and 3 cp nodes.

Enable GracefulNodeShutdown in kubelet config /var/lib/kubelet/config.yaml:

featureGates:
  GracefulNodeShutdown: true
shutdownGracePeriod: 1m0s
shutdownGracePeriodCriticalPods: 10s

Create a deployment and/or daemonset set like this:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: shutdown-test-ds
  name: shutdown-test-ds
spec:
  selector:
    matchLabels:
      app: shutdown-test-ds
  template:
    metadata:
      labels:
        app: shutdown-test-ds
    spec:
      containers:
      - image: busybox
        name: busybox
        command: ['sh', '-c', "echo The app is running!; shutdown() { echo shutting down; exit 0; }; trap 'shutdown' 1 3 9 15; sleep infinity & wait $!; echo sleep over;"]
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: shutdown-test-deploy
  name: shutdown-test-deploy
spec:
  replicas: 4
  selector:
    matchLabels:
      app: shutdown-test-deploy
  template:
    metadata:
      labels:
        app: shutdown-test-deploy
    spec:
      containers:
      - image: busybox
        name: busybox
        command: ['sh', '-c', "echo The app is running!; shutdown() { echo shutting down; exit 0; }; trap 'shutdown' 1 3 9 15; sleep infinity & wait $!; echo sleep over;"]

Shut down the a node via systemctl poweroff or another way that triggers the systemd inhibitor locks and observe the pods on that node with kubectl:

Before shutdown start of node yp-test2-worker-0dd0512820d9:

[test_deploy]$ kubectl -n shutdown-test get pods,nodes -o wide
NAME                                        READY   STATUS    RESTARTS   AGE   IP            NODE                                                 NOMINATED NODE   READINESS GATES
pod/shutdown-test-deploy-69cff7d9b7-bvzj9   1/1     Running   0          47s   172.28.8.51   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-gncbf   1/1     Running   0          47s   172.28.8.52   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-rkpkz   1/1     Running   0          47s   172.28.9.35   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-sc4gg   1/1     Running   0          47s   172.28.8.50   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-ds-nl74c                  1/1     Running   0          38s   172.28.9.36   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-ds-qcpq8                  1/1     Running   0          40s   172.28.8.53   yp-test2-worker-0dd0512820d9                         <none>           <none>

NAME                                                      STATUS   ROLES                               AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node/yp-test2-master-6cd3a6ac34c0                         Ready    control-plane,controlplane,master   3h19m   v1.20.4   10.22.141.156   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-957a196c2211                         Ready    control-plane,controlplane,master   3h20m   v1.20.4   10.22.141.164   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-969874052906                         Ready    control-plane,controlplane,master   3h19m   v1.20.4   10.22.141.157   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-0dd0512820d9                         Ready    worker                              3h11m   v1.20.4   10.22.141.181   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-800c11be6e97                         Ready    worker                              3h11m   v1.20.4   10.22.141.169   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4

Just after shutdown start. Pods get delete but immediately get started again. Node is still ready:

[test_deploy]$ kubectl -n shutdown-test get pods,nodes -o wide
NAME                                        READY   STATUS     RESTARTS   AGE    IP            NODE                                                 NOMINATED NODE   READINESS GATES
pod/shutdown-test-deploy-69cff7d9b7-6glvw   1/1     Running    0          3s     172.28.9.37   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-blspc   0/1     Pending    0          3s     <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-bvzj9   0/1     Shutdown   0          102s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-gncbf   0/1     Shutdown   0          102s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-mv7bd   0/1     Pending    0          3s     <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-rkpkz   1/1     Running    0          102s   172.28.9.35   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-sc4gg   0/1     Shutdown   0          102s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-ds-nl74c                  1/1     Running    0          93s    172.28.9.36   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-ds-szlg9                  0/1     Pending    0          3s     <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>

NAME                                                      STATUS   ROLES                               AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node/yp-test2-master-6cd3a6ac34c0                         Ready    control-plane,controlplane,master   3h20m   v1.20.4   10.22.141.156   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-957a196c2211                         Ready    control-plane,controlplane,master   3h21m   v1.20.4   10.22.141.164   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-969874052906                         Ready    control-plane,controlplane,master   3h20m   v1.20.4   10.22.141.157   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-0dd0512820d9                         Ready    worker                              3h12m   v1.20.4   10.22.141.181   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-800c11be6e97                         Ready    worker                              3h12m   v1.20.4   10.22.141.169   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4

Now node is not ready:

[test_deploy]$ kubectl -n shutdown-test get pods,nodes -o wide
NAME                                        READY   STATUS              RESTARTS   AGE    IP            NODE                                                 NOMINATED NODE   READINESS GATES
pod/shutdown-test-deploy-69cff7d9b7-6glvw   1/1     Running             0          8s     172.28.9.37   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-blspc   0/1     ContainerCreating   0          8s     <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-bvzj9   0/1     Shutdown            0          107s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-gncbf   0/1     Shutdown            0          107s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-mv7bd   0/1     ContainerCreating   0          8s     <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-rkpkz   1/1     Running             0          107s   172.28.9.35   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-sc4gg   0/1     Shutdown            0          107s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-ds-nl74c                  1/1     Running             0          98s    172.28.9.36   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-ds-szlg9                  0/1     ContainerCreating   0          8s     <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>

NAME                                                      STATUS     ROLES                               AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node/yp-test2-master-6cd3a6ac34c0                         Ready      control-plane,controlplane,master   3h20m   v1.20.4   10.22.141.156   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-957a196c2211                         Ready      control-plane,controlplane,master   3h21m   v1.20.4   10.22.141.164   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-969874052906                         Ready      control-plane,controlplane,master   3h20m   v1.20.4   10.22.141.157   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-0dd0512820d9                         NotReady   worker                              3h12m   v1.20.4   10.22.141.181   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-800c11be6e97                         Ready      worker                              3h12m   v1.20.4   10.22.141.169   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4

Containers are running again:

[test_deploy]$ kubectl -n shutdown-test get pods,nodes -o wide
NAME                                        READY   STATUS     RESTARTS   AGE    IP            NODE                                                 NOMINATED NODE   READINESS GATES
pod/shutdown-test-deploy-69cff7d9b7-6glvw   1/1     Running    0          12s    172.28.9.37   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-blspc   1/1     Running    0          12s    172.28.8.55   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-bvzj9   0/1     Shutdown   0          111s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-gncbf   0/1     Shutdown   0          111s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-mv7bd   1/1     Running    0          12s    172.28.8.54   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-rkpkz   1/1     Running    0          111s   172.28.9.35   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-sc4gg   0/1     Shutdown   0          111s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-ds-nl74c                  1/1     Running    0          102s   172.28.9.36   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-ds-szlg9                  1/1     Running    0          12s    172.28.8.56   yp-test2-worker-0dd0512820d9                         <none>           <none>

NAME                                                      STATUS     ROLES                               AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node/yp-test2-master-6cd3a6ac34c0                         Ready      control-plane,controlplane,master   3h20m   v1.20.4   10.22.141.156   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-957a196c2211                         Ready      control-plane,controlplane,master   3h21m   v1.20.4   10.22.141.164   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-969874052906                         Ready      control-plane,controlplane,master   3h20m   v1.20.4   10.22.141.157   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-0dd0512820d9                         NotReady   worker                              3h12m   v1.20.4   10.22.141.181   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-800c11be6e97                         Ready      worker                              3h12m   v1.20.4   10.22.141.169   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4

Node fully powered off, it stays like this:

[test_deploy]$ kubectl -n shutdown-test get pods,nodes -o wide
NAME                                        READY   STATUS     RESTARTS   AGE     IP            NODE                                                 NOMINATED NODE   READINESS GATES
pod/shutdown-test-deploy-69cff7d9b7-6glvw   1/1     Running    0          56s     172.28.9.37   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-blspc   1/1     Running    0          56s     172.28.8.55   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-bvzj9   0/1     Shutdown   0          2m35s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-gncbf   0/1     Shutdown   0          2m35s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-mv7bd   1/1     Running    0          56s     172.28.8.54   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-rkpkz   1/1     Running    0          2m35s   172.28.9.35   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-sc4gg   0/1     Shutdown   0          2m35s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-ds-nl74c                  1/1     Running    0          2m26s   172.28.9.36   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-ds-szlg9                  1/1     Running    0          56s     172.28.8.56   yp-test2-worker-0dd0512820d9                         <none>           <none>

NAME                                                      STATUS     ROLES                               AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node/yp-test2-master-6cd3a6ac34c0                         Ready      control-plane,controlplane,master   3h21m   v1.20.4   10.22.141.156   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-957a196c2211                         Ready      control-plane,controlplane,master   3h22m   v1.20.4   10.22.141.164   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-969874052906                         Ready      control-plane,controlplane,master   3h21m   v1.20.4   10.22.141.157   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-0dd0512820d9                         NotReady   worker                              3h12m   v1.20.4   10.22.141.181   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-800c11be6e97                         Ready      worker                              3h12m   v1.20.4   10.22.141.169   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4

Node started again:

[test_deploy]$ kubectl -n shutdown-test get pods,nodes -o wide
NAME                                        READY   STATUS     RESTARTS   AGE     IP            NODE                                                 NOMINATED NODE   READINESS GATES
pod/shutdown-test-deploy-69cff7d9b7-6glvw   1/1     Running    0          2m41s   172.28.9.37   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-blspc   1/1     Running    1          2m41s   172.28.8.61   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-bvzj9   0/1     Shutdown   0          4m20s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-gncbf   0/1     Shutdown   0          4m20s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-mv7bd   1/1     Running    1          2m41s   172.28.8.63   yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-rkpkz   1/1     Running    0          4m20s   172.28.9.35   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-deploy-69cff7d9b7-sc4gg   0/1     Shutdown   0          4m20s   <none>        yp-test2-worker-0dd0512820d9                         <none>           <none>
pod/shutdown-test-ds-nl74c                  1/1     Running    0          4m11s   172.28.9.36   yp-test2-worker-800c11be6e97                         <none>           <none>
pod/shutdown-test-ds-szlg9                  1/1     Running    1          2m41s   172.28.8.65   yp-test2-worker-0dd0512820d9                         <none>           <none>

NAME                                                      STATUS   ROLES                               AGE     VERSION   INTERNAL-IP     EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
node/yp-test2-master-6cd3a6ac34c0                         Ready    control-plane,controlplane,master   3h22m   v1.20.4   10.22.141.156   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-957a196c2211                         Ready    control-plane,controlplane,master   3h23m   v1.20.4   10.22.141.164   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-master-969874052906                         Ready    control-plane,controlplane,master   3h23m   v1.20.4   10.22.141.157   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-0dd0512820d9                         Ready    worker                              3h14m   v1.20.4   10.22.141.181   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4
node/yp-test2-worker-800c11be6e97                         Ready    worker                              3h14m   v1.20.4   10.22.141.169   <none>        Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.4.4

Environment:

  • Kubernetes version (use kubectl version): 1.20.4
  • Cloud provider or hardware configuration: VMWare on-prem
  • OS (e.g: cat /etc/os-release): Ubuntu 20.04.2 LTS
  • Kernel (e.g. uname -a): 5.4.0-66-generic
  • Install tools: kubeadm, ansible
@yvespp yvespp added the kind/bug Categorizes issue or PR as related to a bug. label Mar 12, 2021
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 12, 2021
@k8s-ci-robot
Copy link
Contributor

@yvespp: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 12, 2021
@yvespp
Copy link
Contributor Author

yvespp commented Mar 12, 2021

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 12, 2021
@bobbypage
Copy link
Member

bobbypage commented Mar 12, 2021

Maybe it's because the node is not marked as NotReady before pods are deleted.

What version of k8s did you use here for the test?

Did the version include #98005? That PR ensured that node is marked not ready as soon as the shutdown is sent. The PR was only applied on master branch (i.e. 1.21). It was backported to 1.20 in #99254 (but that PR was only merged yesterday and isn't in any released minor version yet).

Edit: I see you mentioned you were on 1.20.4. That definitely doesn't include #98005 . I would attempt to retry with a version that includes that fix, i.e. either build from master or wait until 1.20.5 will be cut.

@bobbypage
Copy link
Member

/cc @wzshiming

@yvespp
Copy link
Contributor Author

yvespp commented Mar 13, 2021

It's with 1.20.4
Ok then this is a duplicate of #98004
Will test again with v1.20.5 which hopefully includs the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

No branches or pull requests

3 participants