An example of a production cluster configuration. #401

mkuratczyk · 2020-10-20T15:24:01Z

This example provides a starting point for production deployments. I'm not sure whether the name "production" is a good choice - feel free to suggestion something better given that this is not suitable for all production use cases and we may want to provide other examples in the future (eg. production-XXL).

Closes #398

Zerpet

Looks good to me. TKG in AWS provisions nodes with label topology.kubernetes.io/zone, in case you want to "flavour" the example 😉

Should this PR close #398 ? I think @ChunyiLyu was working on this issue as well.

Edit: I think this should close #397 instead?

gerhard · 2020-10-21T10:00:54Z

This is a welcome step in the right direction, thank you @mkuratczyk for taking it.

I feel that some changes are required, as well as sharing more of the reasoning behind them, before this can be merged.

This is what I'm thinking for my next steps:

Add more detail to the private thread that kicked this off. Those that have access to it might want to read for full context.
Commit the latest changes to the StatefulSet that we are using as the sample for the production-ready deployment.
Make specific suggestions to the example proposed in this commit, so that we can merge it. I imagine discussions arising at the previous points, and until we share specific learnings and reach consensus, we cannot complete this step.

Let me know if you think that there is a different approach that would make more sense.

ChunyiLyu

Please see comments

docs/examples/production/rabbitmq.yaml

docs/examples/production/README.md

docs/examples/production/rabbitmq.yaml

gerhard

Looks great!

mboutet · 2020-11-30T22:06:43Z

I have four questions regarding the production-ready example:

Why is there no PDB? It seems a little odd to not include it since it's paramount to ensure the majority of the node is available in the events of node draining, eviction, upgrade, etc.
What's the reasoning behind cluster_partition_handling = ignore? From my understanding after reading the documentation is that pause_minority would be more appropriate. According to Which Mode to Pick?, it says that the pause_minority mode is ideal when "when clustering across racks or availability zones in a single region" which is what the production-ready example does.
What's the reasoning behind vm_memory_high_watermark_paging_ratio = 0.99? It seems to me that it won't give much chance for rabbitmq to page to disk before the producers are blocked because of the memory alarm.
What's the reasoning behind disk_free_limit.relative = 1.0 when the production checklist recommends disk_free_limit.relative = 1.5?

mkuratczyk · 2020-12-01T13:18:32Z

For the PodDisruptionBudget - no good reason, I've just added it: #510

lukebakken · 2020-12-01T13:46:09Z

What's the reasoning behind cluster_partition_handling = ignore?

The RabbitMQ core eng team had a long discussion about this the other day. From what I can remember, in the context of k8s ignore is the "least bad" option. @gerhard , @kjnilsson or @michaelklishin will have a better memory of that discussion.

What's the reasoning behind vm_memory_high_watermark_paging_ratio = 0.99

I don't know off the top of my head but my guess is that is a performance setting (@gerhard ?)

What's the reasoning behind disk_free_limit.relative = 1.0

1.0 is the minimum recommended value from the checklist.

michaelklishin · 2020-12-01T15:48:43Z

@mboutet

The very first sentence of this PR goes like this: «This example provides a starting point for production deployments…»

There is no way our team can know the realities and needs of your specific production deployment, so just like the Production Checklist guide these are basic guidelines to make
sure a reasonable degree of safety and optimal disk I/O for at least some workloads.

There is no One True Default there. The actual solution would be switch to a Raft-based schema data store (which we have in prototype) and do away with all partition recovery strategies (the system will recover much like any Raft-based system would).
The ratio is compared to the ratio of UsedProcessMemory/Watermark. The value of 0.99 will delay paging for as long as possible. Again, no One True Default here. The default in RabbitMQ itself is 0.5.
I'd say it should be 1.5 but that would potentially greatly overprovision disks for nodes with more memory available.

An example of a production cluster configuration.

c8be0b3

Closes #398

mkuratczyk requested a review from gerhard October 20, 2020 15:24

MirahImage approved these changes Oct 21, 2020

View reviewed changes

Zerpet approved these changes Oct 21, 2020

View reviewed changes

Zerpet linked an issue Oct 22, 2020 that may be closed by this pull request

Document a good starting point for a production deployment #397

Closed

ChunyiLyu reviewed Oct 23, 2020

View reviewed changes

docs/examples/production/rabbitmq.yaml Outdated Show resolved Hide resolved

docs/examples/production/rabbitmq.yaml Outdated Show resolved Hide resolved

Use the well-known topologyKey

30c6627

gerhard reviewed Oct 28, 2020

View reviewed changes

docs/examples/production/README.md Outdated Show resolved Hide resolved

gerhard reviewed Oct 28, 2020

View reviewed changes

docs/examples/production/rabbitmq.yaml Outdated Show resolved Hide resolved

gerhard reviewed Oct 28, 2020

View reviewed changes

docs/examples/production/rabbitmq.yaml Outdated Show resolved Hide resolved

bump resources

2bb6bd7

mkuratczyk requested review from gerhard and Zerpet November 3, 2020 13:36

Add SSD storageClass example

6058344

gerhard approved these changes Nov 4, 2020

View reviewed changes

gerhard merged commit 3c93b46 into main Nov 4, 2020

gerhard deleted the production-example branch November 4, 2020 16:17

ansd mentioned this pull request Dec 3, 2020

PodDiscruptionBudget #512

Closed

gvmw mentioned this pull request Nov 3, 2021

Answer some production-ready questions that came up in private threads #886

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An example of a production cluster configuration. #401

An example of a production cluster configuration. #401

mkuratczyk commented Oct 20, 2020

Zerpet left a comment •

edited

Loading

gerhard commented Oct 21, 2020

ChunyiLyu left a comment

gerhard left a comment

mboutet commented Nov 30, 2020 •

edited

Loading

mkuratczyk commented Dec 1, 2020

lukebakken commented Dec 1, 2020

michaelklishin commented Dec 1, 2020

An example of a production cluster configuration. #401

An example of a production cluster configuration. #401

Conversation

mkuratczyk commented Oct 20, 2020

Zerpet left a comment • edited Loading

Choose a reason for hiding this comment

gerhard commented Oct 21, 2020

ChunyiLyu left a comment

Choose a reason for hiding this comment

gerhard left a comment

Choose a reason for hiding this comment

mboutet commented Nov 30, 2020 • edited Loading

mkuratczyk commented Dec 1, 2020

lukebakken commented Dec 1, 2020

michaelklishin commented Dec 1, 2020

Zerpet left a comment •

edited

Loading

mboutet commented Nov 30, 2020 •

edited

Loading