Pod stuck in Terminating state #409

gerhard · 2020-10-23T15:32:37Z

Having applied a change that requires all pods in the RabbitmqCluster to be updated, the last pod is stuck in Terminating state.

I am using v0.47.0 on GKE 1.18, all waiting for you to take a closer look at. Reach out privately for access details.

The problem in 1 picture:

I suspect that the deadline exceeded for the readinessProbe has something to do with this:

│ Events:                                                                                                                                                                                                                                                                                                                                                                                                                │
│   Type     Reason     Age                    From                                                         Message                                                                                                                                                                                                                                                                                                      │
│   ----     ------     ----                   ----                                                         -------                                                                                                                                                                                                                                                                                                      │
│   Normal   Killing    30m                    kubelet, gke-messaging-streaming-default-pool-baea9f60-gz8t  Stopping container rabbitmq                                                                                                                                                                                                                                                                                  │
│   Warning  Unhealthy  2m11s (x27 over 164m)  kubelet, gke-messaging-streaming-default-pool-baea9f60-gz8t  Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded

Maybe related to #105

The text was updated successfully, but these errors were encountered:

ChunyiLyu · 2020-10-26T15:24:44Z

Had a call with @gerhard and looked into this problem a bit more. We believe that the readiness probe rabbitmqctl check_port_connectivity is causing this issue. Gerhard can have seen this problem using the operator, and when using yaml manifests to deploy directly (both used rabbitmqctl check_port_connectivity as the readiness probe). When he switched to use a tcp probe (see), he cannot reproduce this issue anymore. We don't have a definite explanation about why running rabbitmqctl is causing this, but I think the reasonable next step will be to use the tcp probe as in here

A bit more on the issue: the pod is stuck at terminating because podExec is broken on the pod. When using kubectl exec to do a remote podExec, the command will successfully run, however the session won't terminate. I saw this behavior with any commands (ls, cat, not just rabbitmq cli ones), and with all three pods in the cluster. In the preStop hook, we run rabbitmq-upgrade await_online_quorum_plus_one, rabbitmq-upgrade await_online_synchronized_mirror and rabbitmq-upgrade drain. I ran these three commands using kubectl exec. The commands ran successfully, however the podExec cannot terminate. I think the pod is stuck at terminating because podExec from the preStop hook are stuck.

michaelklishin · 2020-10-26T17:44:27Z

rabbitmq-diagnostics check_port_connectivity expects a fully booted node and is not suitable for readiness probes with sequentially deployed nodes.

gerhard · 2020-10-26T19:53:31Z

A readiness probe determines when a pod is ready to serve traffic. It is meant to mitigate against taking into service pods that cannot handle requests. A TCP probe that checks if port 5672 is open sounds like a great RabbitMQ readiness probe to me. It is the equivalent of nc -z PRIVATE_IP 5672 which is both faster and lighter to run than rabbitmqctl.

As a related question, in your opinion @michaelklishin, what is a good check to run for the startup probe? To be more specific, how do we improve on this?

        # We use a startup probe to determine when the RabbitMQ runtime, Erlang, has started.
        # We check every 10 seconds that: "the node OS process is up (the Erlang VM), registered with EPMD and CLI tools can authenticate with it"
        # We check for up to 5 minutes, or 30 times, as sometimes RabbitMQ could be performing CPU-intensive operations, and starting another Erlang VM to check on RabbitMQ may create too much CPU contention.
        # While this is rare and extreme, it does happen, so we are being extra persistent.
        #
        # While the startup probe runs, both the liveness and the readiness probes are disabled.
        # As a matter of fact, we don't use a livesness probe at all because this is more likely to take down an overloaded RabbitMQ node than solve and actual deadlock.
        #
        # Cloud environments where CPU time can be sliced (a.k.a. limited) in milliseconds is especially bad for runtimes that are natively multi-core.
        # These have many optimisations that are based on the fact that multiple cores are available to them, and CPU limits (and even sharing) can have disastrous side-effects.
        #
        # We let Erlang handle the healthiness of the node, and only inform K8S when it has started via the ping command.
        #
        # https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe
        # https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
        # https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/
        # https://medium.com/swlh/fantastic-probes-and-how-to-configure-them-fef7e030bd2f
        startupProbe:
          exec:
            command:
            - "rabbitmq-diagnostics"
            - "ping"
          failureThreshold: 30
          periodSeconds: 10
          timeoutSeconds: 9

michaelklishin · 2020-10-27T10:31:22Z

rabbitmq-diagnostics ping is optimal for now.

A node startup involves a set of boot steps and then rejoining known cluster peers. It's trivial to produce health checks for the first part but the second part both depends on other nodes to start (so, an easy chicken-and-egg deployment scenario candidate) and cannot be easily asserted, since knowing when we have synced all schema tables is not really trivial.

rabbitmq-diagnostics check_port_connectivity requires RabbitMQ on the node to be running. This won't be the case when schema table syncing is in progress IIRC. The only reason why it expects the node to be running is to discover the active listeners on the node, and then try to connect to their ports. So this is not really an nc equivalent.

If we want to check a TCP port then we can use nc. This can be perfectly sufficient for certain Kubernetes probes. Alternatively, we can introduce an --offline equivalent to said check that would not try to discover any listeners but simply try to connect using default ports. It's a good question what protocols it should cover since we cannot know if, say, MQTT is enabled at all.

michaelklishin · 2020-10-27T11:06:27Z

A quick test suggests that when a node is syncing schema tables or waiting for a peer to come online, rabbit_networking:active_listeners/0 will return an empty list, so we won't be able to discover any ports to try to connect to.

- rabbitmq-diagnostics check_port_connectivity as readiness probe causes context deadline exceeded errors, and pods could be stuck at terminating for deletion - reduce PeriodSeconds in readinessProbe to 10 seconds since a tcp probe should be less expansive than running a diagnostics command - related issue: #409

gerhard · 2020-10-27T18:36:45Z

rabbitmq-diagnostics check_port_connectivity requires RabbitMQ on the node to be running. The only reason why it expects the node to be running is to discover the active listeners on the node, and then try to connect to their ports. So this is not really an nc equivalent.

Yes, that is a great point: it also tries to connect, not just checks if there is a listener, like nc -z does.

A TCP probe does open a socket on the specified port. AMQP port is the only one that cannot be disabled as it belongs to the rabbit app, and not a plugin. As far as I know, we can only define a single readinessProbe, so using tcpSocket on port 5672 sounds most sensible to me.

The more interesting question is why does running rabbitmqctl commands in probes result in the error captured above - Readiness probe errored: rpc error: code = DeadlineExceeded desc = context deadline exceeded - and why does this prevent exec-ing any commands against any of the pods, as @ChunyiLyu discovered. I suspect that starting a full ErlangVM every n seconds to run these commands is not helping the situation, but I would expect K8S to handle this well, and not end up with all pods in this weird state that I do not fully understand.

For the time being, I am glad that we have settled on the TCP probe which seems to solve the immediate problem.

karthimohan · 2020-11-23T21:32:24Z

This could be related to this awslabs/amazon-eks-ami#563

kaushiksrinivas · 2022-01-21T10:54:18Z

@gerhard @michaelklishin
We tried setting readiness probe with tcpsocket on port 5672. And the probe fails while the rabbitmq server during start up is still syncing mnesia with other nodes.

Are we missing something here ?

kaushiksrinivas · 2022-01-24T11:38:26Z

@gerhard @michaelklishin
Can you share your inputs on the tcpsocket test on amqp port for readiness probe ?

gerhard · 2022-01-24T12:17:34Z

If a node takes a long time to boot for genuine reasons - this sounds like one - then the readiness probes will fail as expected. I would suggest trying to understand why a node is slow to boot. A large number of objects such as vhosts, bindings, users would explain a slow boot. In that case, the system works as expected, and something like startupProbes may help: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes. The difficulty lies in picking a startup time which is normal in your case - this is highly contextual (network, disks, etc.) - and a startup time of a few hours may be normal and OK for you.

This is my last reply to this thread, I cannot help further.

cc @lukebakken @mkuratczyk

kaushiksrinivas · 2022-01-24T12:31:25Z

@gerhard thank you for the response.

@lukebakken @mkuratczyk
In a k8s environment with operator, if for some reason a rabbitmq node fails to find its peers and hence in the phase of continous retries to sync mnesia tables, readiness probes keep failing. This can lead to blocking the k8s statefulset object to deploy the nodes further in the set of replicas configured leading to a deadlock situation.

example scenario.
Deploy rabbitmq statefulset via operator with replicas = 2.
node-1 crashes for some reason and then node-0 crashes before statefulset brings node-1 back up.
Now sts tries to boot node-0 in sequence first. Node-0 waits for node-1 to be online to sync data and since node-1 cannot be up till node-0 is up and ready, mnesia sync on node-0 never completes.
This would lead to cluster not usable and deployment would never succeed.

Can you please provide some input on this choice of readiness probe.

michaelklishin · 2022-01-24T13:33:38Z

Our Cluster Formation and Monitoring guides already provide guidance. I don't have much to add besides: a socket that is bound to is not an indication of a system being ready to do everything it can, in particular in a cluster.

gerhard added the bug Something isn't working label Oct 23, 2020

gerhard assigned ChunyiLyu Oct 23, 2020

ChunyiLyu mentioned this issue Oct 26, 2020

Use a tcp readiness probe instead of cli command #410

Merged

This was referenced Oct 27, 2020

Change readinessProbe to tcpSocket #413

Closed

Make sure that we have reasonable defaults #335

Closed

ChunyiLyu closed this as completed in #410 Oct 27, 2020

tobybellwood mentioned this issue Nov 7, 2022

changing the broker readiness and liveness probes to use rabbit-diagn… uselagoon/lagoon-charts#366

Closed

s10 mentioned this issue Feb 28, 2025

[rabbitmq] container probes enhanced sapcc/helm-charts#8041

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod stuck in Terminating state #409

Pod stuck in Terminating state #409

gerhard commented Oct 23, 2020 •

edited

Loading

ChunyiLyu commented Oct 26, 2020 •

edited

Loading

michaelklishin commented Oct 26, 2020

gerhard commented Oct 26, 2020

michaelklishin commented Oct 27, 2020 •

edited

Loading

michaelklishin commented Oct 27, 2020

gerhard commented Oct 27, 2020 •

edited

Loading

karthimohan commented Nov 23, 2020

kaushiksrinivas commented Jan 21, 2022

kaushiksrinivas commented Jan 24, 2022

gerhard commented Jan 24, 2022 •

edited

Loading

kaushiksrinivas commented Jan 24, 2022

michaelklishin commented Jan 24, 2022

Pod stuck in Terminating state #409

Pod stuck in Terminating state #409

Comments

gerhard commented Oct 23, 2020 • edited Loading

ChunyiLyu commented Oct 26, 2020 • edited Loading

michaelklishin commented Oct 26, 2020

gerhard commented Oct 26, 2020

michaelklishin commented Oct 27, 2020 • edited Loading

michaelklishin commented Oct 27, 2020

gerhard commented Oct 27, 2020 • edited Loading

karthimohan commented Nov 23, 2020

kaushiksrinivas commented Jan 21, 2022

kaushiksrinivas commented Jan 24, 2022

gerhard commented Jan 24, 2022 • edited Loading

kaushiksrinivas commented Jan 24, 2022

michaelklishin commented Jan 24, 2022

gerhard commented Oct 23, 2020 •

edited

Loading

ChunyiLyu commented Oct 26, 2020 •

edited

Loading

michaelklishin commented Oct 27, 2020 •

edited

Loading

gerhard commented Oct 27, 2020 •

edited

Loading

gerhard commented Jan 24, 2022 •

edited

Loading