Skip to content

Pod stuck in Terminating state #409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gerhard opened this issue Oct 23, 2020 · 12 comments · Fixed by #410
Closed

Pod stuck in Terminating state #409

gerhard opened this issue Oct 23, 2020 · 12 comments · Fixed by #410
Assignees
Labels
bug Something isn't working

Comments

@gerhard
Copy link
Contributor

gerhard commented Oct 23, 2020

Having applied a change that requires all pods in the RabbitmqCluster to be updated, the last pod is stuck in Terminating state.

I am using v0.47.0 on GKE 1.18, all waiting for you to take a closer look at. Reach out privately for access details.

The problem in 1 picture:

image

I suspect that the deadline exceeded for the readinessProbe has something to do with this:

│ Events:                                                                                                                                                                                                                                                                                                                                                                                                                │
│   Type     Reason     Age                    From                                                         Message                                                                                                                                                                                                                                                                                                      │
│   ----     ------     ----                   ----                                                         -------                                                                                                                                                                                                                                                                                                      │
│   Normal   Killing    30m                    kubelet, gke-messaging-streaming-default-pool-baea9f60-gz8t  Stopping container rabbitmq                                                                                                                                                                                                                                                                                  │
│   Warning  Unhealthy  2m11s (x27 over 164m)  kubelet, gke-messaging-streaming-default-pool-baea9f60-gz8t  Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Maybe related to #105

@gerhard gerhard added the bug Something isn't working label Oct 23, 2020
@ChunyiLyu
Copy link
Contributor

ChunyiLyu commented Oct 26, 2020

Had a call with @gerhard and looked into this problem a bit more. We believe that the readiness probe rabbitmqctl check_port_connectivity is causing this issue. Gerhard can have seen this problem using the operator, and when using yaml manifests to deploy directly (both used rabbitmqctl check_port_connectivity as the readiness probe). When he switched to use a tcp probe (see), he cannot reproduce this issue anymore. We don't have a definite explanation about why running rabbitmqctl is causing this, but I think the reasonable next step will be to use the tcp probe as in here

A bit more on the issue: the pod is stuck at terminating because podExec is broken on the pod. When using kubectl exec to do a remote podExec, the command will successfully run, however the session won't terminate. I saw this behavior with any commands (ls, cat, not just rabbitmq cli ones), and with all three pods in the cluster. In the preStop hook, we run rabbitmq-upgrade await_online_quorum_plus_one, rabbitmq-upgrade await_online_synchronized_mirror and rabbitmq-upgrade drain. I ran these three commands using kubectl exec. The commands ran successfully, however the podExec cannot terminate. I think the pod is stuck at terminating because podExec from the preStop hook are stuck.

@michaelklishin
Copy link
Contributor

rabbitmq-diagnostics check_port_connectivity expects a fully booted node and is not suitable for readiness probes with sequentially deployed nodes.

@gerhard
Copy link
Contributor Author

gerhard commented Oct 26, 2020

A readiness probe determines when a pod is ready to serve traffic. It is meant to mitigate against taking into service pods that cannot handle requests. A TCP probe that checks if port 5672 is open sounds like a great RabbitMQ readiness probe to me. It is the equivalent of nc -z PRIVATE_IP 5672 which is both faster and lighter to run than rabbitmqctl.

As a related question, in your opinion @michaelklishin, what is a good check to run for the startup probe? To be more specific, how do we improve on this?

        # We use a startup probe to determine when the RabbitMQ runtime, Erlang, has started.
        # We check every 10 seconds that: "the node OS process is up (the Erlang VM), registered with EPMD and CLI tools can authenticate with it"
        # We check for up to 5 minutes, or 30 times, as sometimes RabbitMQ could be performing CPU-intensive operations, and starting another Erlang VM to check on RabbitMQ may create too much CPU contention.
        # While this is rare and extreme, it does happen, so we are being extra persistent.
        #
        # While the startup probe runs, both the liveness and the readiness probes are disabled.
        # As a matter of fact, we don't use a livesness probe at all because this is more likely to take down an overloaded RabbitMQ node than solve and actual deadlock.
        #
        # Cloud environments where CPU time can be sliced (a.k.a. limited) in milliseconds is especially bad for runtimes that are natively multi-core.
        # These have many optimisations that are based on the fact that multiple cores are available to them, and CPU limits (and even sharing) can have disastrous side-effects.
        #
        # We let Erlang handle the healthiness of the node, and only inform K8S when it has started via the ping command.
        #
        # https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe
        # https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
        # https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/
        # https://medium.com/swlh/fantastic-probes-and-how-to-configure-them-fef7e030bd2f
        startupProbe:
          exec:
            command:
            - "rabbitmq-diagnostics"
            - "ping"
          failureThreshold: 30
          periodSeconds: 10
          timeoutSeconds: 9

@michaelklishin
Copy link
Contributor

michaelklishin commented Oct 27, 2020

rabbitmq-diagnostics ping is optimal for now.

A node startup involves a set of boot steps and then rejoining known cluster peers. It's trivial to produce health checks for the first part but the second part both depends on other nodes to start (so, an easy chicken-and-egg deployment scenario candidate) and cannot be easily asserted, since knowing when we have synced all schema tables is not really trivial.

rabbitmq-diagnostics check_port_connectivity requires RabbitMQ on the node to be running. This won't be the case when schema table syncing is in progress IIRC. The only reason why it expects the node to be running is to discover the active listeners on the node, and then try to connect to their ports. So this is not really an nc equivalent.

If we want to check a TCP port then we can use nc. This can be perfectly sufficient for certain Kubernetes probes. Alternatively, we can introduce an --offline equivalent to said check that would not try to discover any listeners but simply try to connect using default ports. It's a good question what protocols it should cover since we cannot know if, say, MQTT is enabled at all.

@michaelklishin
Copy link
Contributor

A quick test suggests that when a node is syncing schema tables or waiting for a peer to come online, rabbit_networking:active_listeners/0 will return an empty list, so we won't be able to discover any ports to try to connect to.

ChunyiLyu added a commit that referenced this issue Oct 27, 2020
- rabbitmq-diagnostics check_port_connectivity as readiness
probe causes context deadline exceeded errors, and pods
could be stuck at terminating for deletion
- reduce PeriodSeconds in readinessProbe to 10 seconds since
a tcp probe should be less expansive than running a diagnostics
command
- related issue: #409
@gerhard
Copy link
Contributor Author

gerhard commented Oct 27, 2020

rabbitmq-diagnostics check_port_connectivity requires RabbitMQ on the node to be running. The only reason why it expects the node to be running is to discover the active listeners on the node, and then try to connect to their ports. So this is not really an nc equivalent.

Yes, that is a great point: it also tries to connect, not just checks if there is a listener, like nc -z does.

A TCP probe does open a socket on the specified port. AMQP port is the only one that cannot be disabled as it belongs to the rabbit app, and not a plugin. As far as I know, we can only define a single readinessProbe, so using tcpSocket on port 5672 sounds most sensible to me.

The more interesting question is why does running rabbitmqctl commands in probes result in the error captured above - Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded - and why does this prevent exec-ing any commands against any of the pods, as @ChunyiLyu discovered. I suspect that starting a full ErlangVM every n seconds to run these commands is not helping the situation, but I would expect K8S to handle this well, and not end up with all pods in this weird state that I do not fully understand.

For the time being, I am glad that we have settled on the TCP probe which seems to solve the immediate problem.

@karthimohan
Copy link

This could be related to this awslabs/amazon-eks-ami#563

@kaushiksrinivas
Copy link

@gerhard @michaelklishin
We tried setting readiness probe with tcpsocket on port 5672. And the probe fails while the rabbitmq server during start up is still syncing mnesia with other nodes.

Are we missing something here ?

@kaushiksrinivas
Copy link

@gerhard @michaelklishin
Can you share your inputs on the tcpsocket test on amqp port for readiness probe ?

@gerhard
Copy link
Contributor Author

gerhard commented Jan 24, 2022

If a node takes a long time to boot for genuine reasons - this sounds like one - then the readiness probes will fail as expected. I would suggest trying to understand why a node is slow to boot. A large number of objects such as vhosts, bindings, users would explain a slow boot. In that case, the system works as expected, and something like startupProbes may help: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes. The difficulty lies in picking a startup time which is normal in your case - this is highly contextual (network, disks, etc.) - and a startup time of a few hours may be normal and OK for you.

This is my last reply to this thread, I cannot help further.

cc @lukebakken @mkuratczyk

@kaushiksrinivas
Copy link

@gerhard thank you for the response.

@lukebakken @mkuratczyk
In a k8s environment with operator, if for some reason a rabbitmq node fails to find its peers and hence in the phase of continous retries to sync mnesia tables, readiness probes keep failing. This can lead to blocking the k8s statefulset object to deploy the nodes further in the set of replicas configured leading to a deadlock situation.

example scenario.
Deploy rabbitmq statefulset via operator with replicas = 2.
node-1 crashes for some reason and then node-0 crashes before statefulset brings node-1 back up.
Now sts tries to boot node-0 in sequence first. Node-0 waits for node-1 to be online to sync data and since node-1 cannot be up till node-0 is up and ready, mnesia sync on node-0 never completes.
This would lead to cluster not usable and deployment would never succeed.

Can you please provide some input on this choice of readiness probe.

@michaelklishin
Copy link
Contributor

Our Cluster Formation and Monitoring guides already provide guidance. I don't have much to add besides: a socket that is bound to is not an indication of a system being ready to do everything it can, in particular in a cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants