Node restarts take a very long time #813

bdaoudtdc · 2021-08-19T01:05:02Z

bdaoudtdc
Aug 19, 2021

Describe the bug

We have a cluster of 3 nodes.
All our queues are quorum queues.
If we change any of the config parameters in the YAML file , node restarts may take between 10-40 mins
It takes a very long time to terminate a pod.
The time taken for terminating a pod is random and the long time is not happening on any particular pod.
It may happen on any of the 3 nodes.

Expected behavior
That restart takes less than 3-4 mins of all nodes

Version and environment information

RabbitMQ: [ 3.8.12]
RabbitMQ Cluster Operator: [ 1.7.0 ]
Kubernetes: [e.g. 1.17.13]
This issue has been there since earlier operator versions like 1.2.0 and above.

Zerpet · 2021-08-20T16:05:05Z

Zerpet
Aug 20, 2021
Maintainer

I will convert this issue to a GitHub discussion. Currently GitHub will automatically close and lock the issue even though your question will be transferred and responded to elsewhere. This is to let you know that we do not intend to ignore this but this is how the current GitHub conversion mechanism makes it seem for the users :(

0 replies

Zerpet · 2021-08-20T16:18:20Z

Zerpet
Aug 20, 2021
Maintainer

The Cluster Operator adds a pre-stop hook to your Pods, where it runs rabbitmq-upgrade await_online_quorum_plus_one and its equivalent for classic queues. This command will not return until the queues in the node have sufficient up to date followers to take over the queue leader. In some cases, this can take some time to happen and it greatly depends on the amount of data in-flight and in-disk in RabbitMQ. We do this to avoid losing queue availability upon a Pod restart, which we believe it's more important than fast rolling updates.

This design document explains in more detail the problem we tried to solve. If you want to disable this feature, you can set .spec.terminationGracePeriodSeconds (docs here), which we set to 1 week, i.e. indefinitely, by default, to a lower value like 30. Be aware, however, that some queues may become unavailable, and even some messages may be lost, and it's almost impossible to determine if/what messages could be lost.

0 replies

mkuratczyk · 2021-08-23T11:00:15Z

mkuratczyk
Aug 23, 2021
Maintainer

If the above doesn't answer your question, can you please provide debug-level logs from the node as it shuts down? You can use kubectl rabbitmq debug NAME to enable debug logging at runtime. Thanks

3 replies

bdaoudtdc Aug 23, 2021
Author

Thanks, I'll gather the logs.
This issue happens even if there are no messages being processed.

bdaoudtdc Aug 30, 2021
Author

Two questions ..
How can I turn the debug off..
Also how can I collect the logs for you .. I know I can use kubectl rabbitmq -n rabbitmq-system tail NAME to tail the logs ..but not sure how I can collect it for you.
Thanks,

mkuratczyk Aug 30, 2021
Maintainer

You can use rabbitmqctl set_log_level info to set the log level to info (there is no kubectl-rabbitmq command for this right now so you'd need to run it in all RabbitMQ containers). Also, if you know that the slow part is node shutdown, you can skip that part - once the node restarts it will reset the log level to what is configured.
You can just tail the logs to a file and share that file (kubectl rabbitmq tail FOO > debug.log)- that's ok for us. However, for your own purposes, I hope you have an external log aggregation system (from which you can export these logs as well) - otherwise you'll be in trouble as Kubernetes don't provide native retention for the logs.

bdaoudtdc · 2021-08-31T02:53:03Z

bdaoudtdc
Aug 31, 2021
Author

Thank you , yes we have an external aggregator for log files.
Attached is the log that resulted from a restart of a 3 node cluster.
I believe it took about 35 mins for the restart of the 3 nodes. There was no load of any sort on this cluster , I can say there was even no messages being processed.

I forced the restart by a changing a config in the yaml file & re-applying the file.
kubectl -n rabbitmq-system apply -f rmq-qastk1-cluster.yaml --validate
debug.log.gz

0 replies

mkuratczyk · 2021-09-03T09:07:51Z

mkuratczyk
Sep 3, 2021
Maintainer

Hi. I had a look at this and while I can see there is a problem, it is not clear to me what causes it. On top of that, I can see that the logs are not complete - at the end of the boot process, RabbitMQ prints (at debug level) a message Time to start RabbitMQ: ... - I don't see in your log for any of the nodes. Can you please share the complete log, from the moment you start shutting down the first node, to the moment the last one starts? I hope this will help. If you can also repeat the test with the latest RabbitMQ version (even if just for testing - without upgrading your actual environment), that would help as well. Thank you,

9 replies

kjnilsson Sep 16, 2021
Maintainer

Those logs are definitely incomplete.

mkuratczyk Sep 16, 2021
Maintainer

I've just deployed a cluster of 3 nodes, using rabbitmq:3.8.21 image, declared 1000 quorum queues and performed kubectl rollout restart statefulset rmq-server. Within 3 minutes, all nodes were up and running. This is not to say there is no issue, but the issue is not "if you have a bunch of quorum queues, the restart takes 30 minutes". There is something else going on in your environment and without being able to reproduce this issue, it's really hard for us to help. It could be the connections that you have open, even if they don't produce/consume. It could be the state on disk (you can check the size of the folders where quorum queues are stored - there are some unusual situation where you can have quite a lot of data in those logs, even though there are no messages ready in the queues, etc).

bdaoudtdc Sep 16, 2021
Author

Thank you for looking into the issue and trying to duplicate it .. can you guide me what to check in regards to the disk state.

mkuratczyk Sep 16, 2021
Maintainer

/var/lib/rabbitmq/mnesia/rabbit@/quorum// - do you have any folders with many files or many (mega/giga)bytes?

bdaoudtdc Sep 16, 2021
Author

I see this
375M 00000049.wal
200K 2F_DCT3RF3L2BY0UZE
156K 2F_DCTAMO2LQMEQ10V
156K 2F_DCTLI2BXWSKIRBC
140K 2F_DCTOHEP4EZFA37M
16K 2F_DCTXMES88DI1HPH
12K 2F_DEAIAM4W8H0TDQZ
1.8M 2F_PVH2Q7UZMQK4A8B
624K 2F_PVH32B83HN4UI13
2.2M 2F_PVH38S50LZLSWSV
4.0M 2F_PVH5X60BPZ5YTEQ
688K 2F_PVH77P1ILKK4FX4
676K 2F_PVH82OTKTR65YZP
744K 2F_PVH98RXFDZ3GUOX
692K 2F_PVHDDVUXV3QAAX9
728K 2F_PVHG5YY8ZSXHH93
740K 2F_PVHGNGRJCF0H3FP
1.8M 2F_PVHGWKT7FEUWYMJ
692K 2F_PVHI7ETBICYMYM5
580K 2F_PVHKMRFU9AWXGI9
640K 2F_PVHMREC3EN2A5VJ
716K 2F_PVHOZWR6BIJTPX7
4.0M 2F_PVHQGRDHTHQZ3PV
1.9M 2F_PVHY3IND0MVE8S7
1.9M 2F_PVHZ7KI4XBJTTC3
12K 2F_TESP42GOOAUKCX8
16K meta.dets
12K names.dets

bdaoudtdc · 2021-09-16T18:57:10Z

bdaoudtdc
Sep 16, 2021
Author

I also ran another restart & collected the logs one more time ..
debug.log.gz
.

0 replies

mkuratczyk · 2021-09-17T12:31:58Z

mkuratczyk
Sep 17, 2021
Maintainer

Thanks, this log looks better. It seems like you cluster is pretty unhappy in general. Here are some interesting lines (I removed some repetitive bits for clarity):

rmq-uat-cluster-server-1: 17:55:17.222 [error] <0.301.0> ** Node 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-1: 17:55:17.222 [info] <0.744.0> rabbit on node 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-0: 17:55:22.547 [error] <0.279.0> ** Node 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-0: 17:55:22.547 [info] <0.706.0> rabbit on node 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-2: 17:55:25.543 [error] <0.19130.0> ** Node 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-2: 17:55:25.543 [info] <0.766.0> rabbit on node 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-2: 17:55:25.544 [error] <0.5310.1> ** Node 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-2: 17:55:25.550 [info] <0.766.0> rabbit on node 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-2: 17:55:25.555 [info] <0.25586.484> RabbitMQ is asked to stop...
rmq-uat-cluster-server-2: 17:57:11.722 [error] <0.575.0> Mnesia('rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system'}
rmq-uat-cluster-server-2: 17:57:11.722 [error] <0.575.0> Mnesia('rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system'}
rmq-uat-cluster-server-2: 17:58:15.704 [error] <0.293.0> ** Node 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-2: 17:58:15.704 [info] <0.807.0> rabbit on node 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-1: 17:58:17.234 [info] <0.744.0> rabbit on node 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-1: 17:58:17.234 [error] <0.24346.0> ** Node 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-1: 17:58:17.236 [error] <0.28565.451> ** Node 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-1: 17:58:17.243 [info] <0.744.0> rabbit on node 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-1: 17:58:17.244 [info] <0.29612.451> RabbitMQ is asked to stop...
rmq-uat-cluster-server-0: 17:58:22.559 [error] <0.295.0> ** Node 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-0: 17:58:22.559 [info] <0.706.0> rabbit on node 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-1: 18:13:41.507 [error] <0.480.0> Mnesia('rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system'}
rmq-uat-cluster-server-1: 18:13:41.507 [error] <0.480.0> Mnesia('rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system'}
rmq-uat-cluster-server-1: 18:14:51.029 [error] <0.260.0> ** Node 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-1: 18:14:51.029 [info] <0.718.0> rabbit on node 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-0: 18:14:52.626 [error] <0.29635.483> ** Node 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-0: 18:14:52.628 [error] <0.18146.484> ** Node 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-0: 18:14:52.627 [info] <0.706.0> rabbit on node 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-0: 18:14:52.635 [info] <0.706.0> rabbit on node 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-0: 18:14:52.636 [info] <0.19198.484> RabbitMQ is asked to stop...
rmq-uat-cluster-server-0: 18:14:52.942 [error] <0.27283.145> Error on AMQP connection <0.27283.145> (192.168.223.128:22734 -> 192.168.38.1:5671, vhost: '/', user: 'rabbit_uat_dct', state: running), channel 0:
rmq-uat-cluster-server-2: 18:15:00.781 [info] <0.807.0> rabbit on node 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system' down
rmq-uat-cluster-server-2: 18:15:00.781 [error] <0.288.0> ** Node 'rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system' not responding **
rmq-uat-cluster-server-0: 18:30:06.819 [error] <0.539.0> Mnesia('rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@rmq-uat-cluster-server-1.rmq-uat-cluster-nodes.rabbitmq-system'}
rmq-uat-cluster-server-0: 18:30:06.819 [error] <0.539.0> Mnesia('rabbit@rmq-uat-cluster-server-0.rmq-uat-cluster-nodes.rabbitmq-system'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@rmq-uat-cluster-server-2.rmq-uat-cluster-nodes.rabbitmq-system'}

So it seems like the first intentional node shutdown was initiated at 17:55:25 but before that happens, there are already reports of that node being down. Then the situation repeats - different nodes report other nodes being down. Because of that (I guess) you get mnesia errors as well.

I don't think this is directly related to quorum queues. For some reason your nodes cannot reliably communicate with each other. That's I guess why we see net_tick_timeout and election timeout logs as well and why everything takes a long time.

My suggestions would be:

observe if you see errors like the above when you don't perform any restarts
check your networking
monitor load on the servers - your YAML doesn't contain any CPU/RAM specification so your cluster is running with the defaults. It could be that CPUs are overloaded and that's why nodes become unresponsive

9 replies

bdaoudtdc Oct 1, 2021
Author

Thank you , I edited the stateful set and .. I see the 2 new files prestop.start & prestop.stop are created , but prestop.start is empty ..

on rmq-uat-cluster-server-2
the timestamp of prestop.start from the OS is 23:40

cat prestop.stop

23:42:42.224727645

on rmq-uat-cluster-server-1
the timestamp of prestop.start from the OS is 23:43

cat prestop.stop

23:45:01.358850145

on rmq-uat-cluster-server-0
the timestamp of prestop.start from the OS is 23:57

cat prestop.stop

23:59:09.422568273

Here is an idea about the age difference
NAME READY STATUS RESTARTS AGE
rmq-uat-cluster-server-0 1/1 Running 0 13m
rmq-uat-cluster-server-1 1/1 Running 0 30m
rmq-uat-cluster-server-2 1/1 Running 0 44m
debug.log.gz

Gsantomaggio Oct 1, 2021
Maintainer

Thank you @bdaoudtdc we can't reproduce the issue internally yet.

We are almost sure that the problem is on rabbitmq-upgrade await_online_quorum_plus_one ... etc

Could you edit the preStop hook and removing the following code:

- if [ ! -z "$(cat /etc/pod-info/skipPreStopChecks)" ]; then exit 0;
fi; rabbitmq-upgrade await_online_quorum_plus_one -t 604800; rabbitmq-upgrade
await_online_synchronized_mirror -t 604800; rabbitmq-upgrade drain
-t 604800

NOTE: this change makes the update less safe, so it would be better to execute in a test environment

bdaoudtdc Oct 7, 2021
Author

Sorry for the late update ..
I removed the part you mentioned .. the prestop looks like this now

lifecycle:
preStop:
exec:
command:
- /bin/bash
- -c

After editing we still see same behavior ... see difference in pods startup
rmq-uat-cluster-server-0 1/1 Running 0 6m24s
rmq-uat-cluster-server-1 1/1 Running 0 22m
rmq-uat-cluster-server-2 1/1 Running 0 39m

The restart started at Thu Oct 7 17:53:4
debug.log.gz

mkuratczyk Oct 8, 2021
Maintainer

Did you restart it twice? Remember that when you edit the statefulset definition, pods still have the previous definition. Therefore, you need to change preStop hook, restart the pods to apply this change and only the next restart will be actually executed by pods with the customised preStop.

bdaoudtdc Oct 8, 2021
Author

So when I edited the deployment the statefulset restarted and the results were what I shared yesterday.

I ran 2 restarts today .. and the gap is still there

rmq-uat-cluster-server-0 1/1 Running 0 77m
rmq-uat-cluster-server-1 1/1 Running 1 103m
rmq-uat-cluster-server-2 1/1 Running 0 104m

rmq-uat-cluster-server-0 1/1 Running 0 7m36s
rmq-uat-cluster-server-1 1/1 Running 0 23m
rmq-uat-cluster-server-2 1/1 Running 0 39m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node restarts take a very long time #813

{{title}}

Replies: 7 comments 21 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Node restarts take a very long time #813

bdaoudtdc Aug 19, 2021

Describe the bug

Version and environment information

Replies: 7 comments · 21 replies

Zerpet Aug 20, 2021 Maintainer

Zerpet Aug 20, 2021 Maintainer

mkuratczyk Aug 23, 2021 Maintainer

bdaoudtdc Aug 23, 2021 Author

bdaoudtdc Aug 30, 2021 Author

mkuratczyk Aug 30, 2021 Maintainer

bdaoudtdc Aug 31, 2021 Author

mkuratczyk Sep 3, 2021 Maintainer

kjnilsson Sep 16, 2021 Maintainer

mkuratczyk Sep 16, 2021 Maintainer

bdaoudtdc Sep 16, 2021 Author

mkuratczyk Sep 16, 2021 Maintainer

bdaoudtdc Sep 16, 2021 Author

bdaoudtdc Sep 16, 2021 Author

mkuratczyk Sep 17, 2021 Maintainer

bdaoudtdc Oct 1, 2021 Author

cat prestop.stop

cat prestop.stop

cat prestop.stop

Gsantomaggio Oct 1, 2021 Maintainer

bdaoudtdc Oct 7, 2021 Author

mkuratczyk Oct 8, 2021 Maintainer

bdaoudtdc Oct 8, 2021 Author

bdaoudtdc
Aug 19, 2021

Replies: 7 comments 21 replies

Zerpet
Aug 20, 2021
Maintainer

Zerpet
Aug 20, 2021
Maintainer

mkuratczyk
Aug 23, 2021
Maintainer

bdaoudtdc Aug 23, 2021
Author

bdaoudtdc Aug 30, 2021
Author

mkuratczyk Aug 30, 2021
Maintainer

bdaoudtdc
Aug 31, 2021
Author

mkuratczyk
Sep 3, 2021
Maintainer

kjnilsson Sep 16, 2021
Maintainer

mkuratczyk Sep 16, 2021
Maintainer

bdaoudtdc Sep 16, 2021
Author

mkuratczyk Sep 16, 2021
Maintainer

bdaoudtdc Sep 16, 2021
Author

bdaoudtdc
Sep 16, 2021
Author

mkuratczyk
Sep 17, 2021
Maintainer

bdaoudtdc Oct 1, 2021
Author

Gsantomaggio Oct 1, 2021
Maintainer

bdaoudtdc Oct 7, 2021
Author

mkuratczyk Oct 8, 2021
Maintainer

bdaoudtdc Oct 8, 2021
Author