-
Notifications
You must be signed in to change notification settings - Fork 288
Pod stuck in Terminating state #409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Had a call with @gerhard and looked into this problem a bit more. We believe that the readiness probe A bit more on the issue: the pod is stuck at terminating because podExec is broken on the pod. When using |
|
A readiness probe determines when a pod is ready to serve traffic. It is meant to mitigate against taking into service pods that cannot handle requests. A TCP probe that checks if port 5672 is open sounds like a great RabbitMQ readiness probe to me. It is the equivalent of As a related question, in your opinion @michaelklishin, what is a good check to run for the startup probe? To be more specific, how do we improve on this? # We use a startup probe to determine when the RabbitMQ runtime, Erlang, has started.
# We check every 10 seconds that: "the node OS process is up (the Erlang VM), registered with EPMD and CLI tools can authenticate with it"
# We check for up to 5 minutes, or 30 times, as sometimes RabbitMQ could be performing CPU-intensive operations, and starting another Erlang VM to check on RabbitMQ may create too much CPU contention.
# While this is rare and extreme, it does happen, so we are being extra persistent.
#
# While the startup probe runs, both the liveness and the readiness probes are disabled.
# As a matter of fact, we don't use a livesness probe at all because this is more likely to take down an overloaded RabbitMQ node than solve and actual deadlock.
#
# Cloud environments where CPU time can be sliced (a.k.a. limited) in milliseconds is especially bad for runtimes that are natively multi-core.
# These have many optimisations that are based on the fact that multiple cores are available to them, and CPU limits (and even sharing) can have disastrous side-effects.
#
# We let Erlang handle the healthiness of the node, and only inform K8S when it has started via the ping command.
#
# https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe
# https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
# https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/
# https://medium.com/swlh/fantastic-probes-and-how-to-configure-them-fef7e030bd2f
startupProbe:
exec:
command:
- "rabbitmq-diagnostics"
- "ping"
failureThreshold: 30
periodSeconds: 10
timeoutSeconds: 9 |
A node startup involves a set of boot steps and then rejoining known cluster peers. It's trivial to produce health checks for the first part but the second part both depends on other nodes to start (so, an easy chicken-and-egg deployment scenario candidate) and cannot be easily asserted, since knowing when we have synced all schema tables is not really trivial.
If we want to check a TCP port then we can use |
A quick test suggests that when a node is syncing schema tables or waiting for a peer to come online, |
- rabbitmq-diagnostics check_port_connectivity as readiness probe causes context deadline exceeded errors, and pods could be stuck at terminating for deletion - reduce PeriodSeconds in readinessProbe to 10 seconds since a tcp probe should be less expansive than running a diagnostics command - related issue: #409
Yes, that is a great point: it also tries to connect, not just checks if there is a listener, like A TCP probe does open a socket on the specified port. AMQP port is the only one that cannot be disabled as it belongs to the rabbit app, and not a plugin. As far as I know, we can only define a single The more interesting question is why does running For the time being, I am glad that we have settled on the TCP probe which seems to solve the immediate problem. |
This could be related to this awslabs/amazon-eks-ami#563 |
@gerhard @michaelklishin Are we missing something here ? |
@gerhard @michaelklishin |
If a node takes a long time to boot for genuine reasons - this sounds like one - then the readiness probes will fail as expected. I would suggest trying to understand why a node is slow to boot. A large number of objects such as vhosts, bindings, users would explain a slow boot. In that case, the system works as expected, and something like This is my last reply to this thread, I cannot help further. |
@gerhard thank you for the response. @lukebakken @mkuratczyk example scenario. Can you please provide some input on this choice of readiness probe. |
Our Cluster Formation and Monitoring guides already provide guidance. I don't have much to add besides: a socket that is bound to is not an indication of a system being ready to do everything it can, in particular in a cluster. |
Having applied a change that requires all pods in the
RabbitmqCluster
to be updated, the last pod is stuck in Terminating state.I am using v0.47.0 on GKE 1.18, all waiting for you to take a closer look at. Reach out privately for access details.
The problem in 1 picture:
I suspect that the deadline exceeded for the readinessProbe has something to do with this:
Maybe related to #105
The text was updated successfully, but these errors were encountered: