Replies: 7 comments 21 replies
-
I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :( |
Beta Was this translation helpful? Give feedback.
-
The Cluster Operator adds a pre-stop hook to your Pods, where it runs This design document explains in more detail the problem we tried to solve. If you want to disable this feature, you can set |
Beta Was this translation helpful? Give feedback.
-
If the above doesn't answer your question, can you please provide debug-level logs from the node as it shuts down? You can use |
Beta Was this translation helpful? Give feedback.
-
Thank you , yes we have an external aggregator for log files. I forced the restart by a changing a config in the yaml file & re-applying the file. |
Beta Was this translation helpful? Give feedback.
-
Hi. I had a look at this and while I can see there is a problem, it is not clear to me what causes it. On top of that, I can see that the logs are not complete - at the end of the boot process, RabbitMQ prints (at debug level) a message |
Beta Was this translation helpful? Give feedback.
-
I also ran another restart & collected the logs one more time .. |
Beta Was this translation helpful? Give feedback.
-
Thanks, this log looks better. It seems like you cluster is pretty unhappy in general. Here are some interesting lines (I removed some repetitive bits for clarity):
So it seems like the first intentional node shutdown was initiated at 17:55:25 but before that happens, there are already reports of that node being down. Then the situation repeats - different nodes report other nodes being down. Because of that (I guess) you get mnesia errors as well. I don't think this is directly related to quorum queues. For some reason your nodes cannot reliably communicate with each other. That's I guess why we see My suggestions would be:
|
Beta Was this translation helpful? Give feedback.
-
Describe the bug
We have a cluster of 3 nodes.
All our queues are quorum queues.
If we change any of the config parameters in the YAML file , node restarts may take between 10-40 mins
It takes a very long time to terminate a pod.
The time taken for terminating a pod is random and the long time is not happening on any particular pod.
It may happen on any of the 3 nodes.
Expected behavior
That restart takes less than 3-4 mins of all nodes
Version and environment information
This issue has been there since earlier operator versions like 1.2.0 and above.
Beta Was this translation helpful? Give feedback.
All reactions