Skip to content

PodDiscruptionBudget #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ansd opened this issue Dec 3, 2020 · 10 comments
Closed

PodDiscruptionBudget #512

ansd opened this issue Dec 3, 2020 · 10 comments
Labels
closed-stale Issue or PR closed due to long period of inactivity enhancement New feature or request stale Issue or PR with long period of inactivity

Comments

@ansd
Copy link
Member

ansd commented Dec 3, 2020

In #510 we added a PodDiscruptionBudget to the production examples.

Instead of having the user manually create and delete this object, let's have the cluster operator manage this object.

The PodDiscruptionBudget feature is in beta stage (since K8s v1.5), i.e. enabled by default. Are there users which have beta features disabled? If so, we could hide this feature behind a new top level property to the RabbitMQ spec.

See also #401 (comment).

@ansd ansd added the enhancement New feature or request label Dec 3, 2020
@mkuratczyk
Copy link
Collaborator

Should we set the value of maxUnavailable as a percentage (if so, should that be 51%?), calculate it based on replicas or just always set it to 1?

@ansd
Copy link
Member Author

ansd commented Dec 7, 2020

In a 5 node cluster, a quorum queue might only have 3 replicas. If we allowed 2 of these 3 replicas to be unavailable, the quorum would be lost.

On the one hand, setting maxUnavailable to 51% is safe since commands like kubectl drain will execute the preStop hook which will block if a quorum would be lost:

fmt.Sprintf(" rabbitmq-upgrade await_online_quorum_plus_one -t %d;"+
" rabbitmq-upgrade await_online_synchronized_mirror -t %d;"+
" rabbitmq-upgrade drain -t %d",

On the other hand, it's a better user experience if the eviction request that kubectl submits on behalf of the user is temporarily rejected rather than having the RabbitMQ pod being blocked in the preStop hook.

Therefore, I vote to hardcode PodDiscruptionBudget to 1.

@ChunyiLyu
Copy link
Contributor

ChunyiLyu commented Dec 8, 2020

There are two different questions that regarding supporting PDB. One is whether to create a default PDB for every rabbitmqcluster. Second is whether we should have a top level property PDB in our CRD spec. I'm responding to the first question in this comment.

I don't think we should create a default PDB for all rabbitmqclusters. First reason is that different from something like pod topology spread constraint, PDB has a blocking effect. When unsatisfied, it can block eviction of pods and therefore block regular k8s worker nodes activities like drain and upgrade. This is problematic because our operator has no control over how rabbitmqcluster pods are going to be placed since it's completely up to users to specific hard requirements on pod scheduling such as affinity/anti-affinity rules. For example, with maxUnavailable set to 1 as suggested in this issue, if two rabbitmq pods (from the same cluster), are located on the same k8s worker node, this worker node will be undrainable since according to the PDB, you cannot evict both pods at the same time. Another example is given a dev k8s cluster with just one worker node, this single node k8s cluster cannot be drained properly either if a rabbitmqcluster is deployed with a maxUnavailable specific PDB.

Second reason that I don't think it's a good idea to have default PDB is that we do not restrict users configurations on number of replicas that they can specify and we do not have strong recommendations/limits on how many rabbitmq clusters can be created given a single k8s cluster. A reasonable PDB for a three node rabbitmq cluster is not going to be the same to a five node rabbitmq cluster. With no limitation on number of replicas that users can have for a rabbitmq cluster, having maxUnavailable set to one is not going to make sense for any rabbitmq cluster that has more than 4 nodes.

In summary, I think given we do not have hard requirement on how users can deploy rabbitmqcluster and we don't have control over k8s clusters that users use, it's extremely difficult to have a default PDB that has a guarantee to work for majority of users and a default PDB will likely limit use cases that the operator supports. I don't think we should do it by default. I think leaving it to the user is the right option until we actually decide to limit the type of rabbitmq deployment that the operator support.

Thoughts? @ansd @mkuratczyk

@mkuratczyk
Copy link
Collaborator

Looking at other operators, it seems like at least some of them set the default of maxUnavailable: 1, for example:
https://github.com/percona/percona-server-mongodb-operator/blob/22f6aa65d0cf4e724af0444de642c49f39ea04ec/pkg/apis/psmdb/v1/psmdb_defaults.go#L455

or include the PDB in the chart:
https://github.com/helm/charts/blob/master/stable/cockroachdb/values.yaml#L144

I feel like the value of maxUnavailable: 1 is useful for a vast majority of deployments if the number of replicas is higher than 1. It's most likely the right setting even if the number of nodes is higher than 3 because the replication factor is likely 3 or less for many queues. If we expose it as an int on the top level, it would be easy to set it or delete it as the user prefers.

Given all the arguments I've seen so far, if we create PDB at all then the logic could look like this:

  1. if replicas: 1 then we don't create podDisruptionBudget
  2. if replicas > 1 and maxUnavailable is not set then we add maxUnavailable: 1 to the spec
  3. if spec.maxUnavailable is present (whether set by the user explicitly or injected by the operator) and lower then replicas then we create podDisruptionBudget accordingly
  4. implicitly, if the user doesn't want to create PDB, they can set maxUnavailable to equal replicas which will be an explicit agreement to the cluster potentially going down

Is this something people would be comfortable with?

@Zerpet
Copy link
Member

Zerpet commented Dec 11, 2020

I agree with @ChunyiLyu here. We should not automate creation of PDBs. On top of what Chunyi shared, enabling the Operator to create PDBs will imply that we will allow the Operator to create/delete/update/watch PDB in all the cluster, given how we configure the Operator RBAC. I can foresee many human operators not being happy with this. We reached a similar conclussion with regards to Pod Security Policies.

Allowing the Operator to perform CRUD operations on PDBs cluster-wide may also allow privilege escalation for the end-user. Alana (Alana who? 😄) might (likely) enable Cody (Cody who? 😄) to create RabbitmqCluster objects; it would be surprising that this gives Cody power to create PDBs through the Operator, even if it's just one PDB with static configuration i.e. like suggested by Michal.

We can instead document with an example how to leverage PDBs and set some recommendations. Perhaps even update the production ready example with one PDB with some sensible defaults. I definitely oppose to automate and own PDBs in the Operator.

@mkuratczyk
Copy link
Collaborator

@ansd
Copy link
Member Author

ansd commented Dec 15, 2020

@ChunyiLyu regarding your 1st reason:

When unsatisfied, it can block eviction of pods and therefore block regular k8s worker nodes activities like drain and upgrade.

That's desired behaviour for this feature.

For example, with maxUnavailable set to 1 as suggested in this issue, if two rabbitmq pods (from the same cluster), are located on the same k8s worker node, this worker node will be undrainable since according to the PDB, you cannot evict both pods at the same time.

Our topology spread constraint implementation in

TopologySpreadConstraints: []corev1.TopologySpreadConstraint{
{
MaxSkew: 1,
// "topology.kubernetes.io/zone" is a well-known label.
// It is automatically set by kubelet if the cloud provider provides the zone information.
// See: https://kubernetes.io/docs/reference/kubernetes-api/labels-annotations-taints/#topologykubernetesiozone
TopologyKey: "topology.kubernetes.io/zone",
WhenUnsatisfiable: corev1.ScheduleAnyway,
LabelSelector: &metav1.LabelSelector{
MatchLabels: metadata.LabelSelector(builder.Instance.Name),
},
},
},
schedules different RabbitMQ pods within the same RabbitMQ cluster across different zones, i.e. onto different K8s nodes. If, for some reason, spreading pods across different zones is not supported, the user can override maxUnavailable as suggested by @mkuratczyk. Even if the user didn't override maxUnavailable, the K8s worker node will not be undrainable since either the user can manually evict pods (as described here) or they can kubectl drain --disable-eviction=true which will bypass checking PodDisruptionBudgets.

Another example is given a dev k8s cluster with just one worker node, this single node k8s cluster cannot be drained properly either if a rabbitmqcluster is deployed with a maxUnavailable specific PDB.

Making the RabbitMQ cluster operator production safe removing the burden for users to manually create and delete PodDiscruptionBudgets feels more important to me than optimising for draining a dev cluster. Since it's only a dev cluster, a user can easily run kubectl drain --disable-eviction=true.

Regarding your 2nd reason:

A reasonable PDB for a three node rabbitmq cluster is not going to be the same to a five node rabbitmq cluster. With no limitation on number of replicas that users can have for a rabbitmq cluster, having maxUnavailable set to one is not going to make sense for any rabbitmq cluster that has more than 4 nodes.

Why doesn't it make sense to set the default PodDisruptionBudget to 1 for a 5 node RabbitMQ cluster?
Setting it to 1 sounds like a safe default. If the worry is that there are 5 pods of the same RabbitMQ cluster distributed across only 3 K8s nodes, the user can override the PodDisruptionBudget.

@Zerpet I'm not sure whether I understand why it gives the user (Cody) privilege escalation if the operator creates a PodDisruptionBudget object for a particular RabbitMQ cluster?

In summary, I think it's a good idea to have the operator create PodDisruptionBudgets by default allowing the user to override the values. The Kafka operator is doing the same. See here and here.

@Zerpet
Copy link
Member

Zerpet commented Jan 25, 2021

@Zerpet I'm not sure whether I understand why it gives the user (Cody) privilege escalation if the operator creates a PodDisruptionBudget object for a particular RabbitMQ cluster?

The platform administrator (Alana) enables Cody to create RabbitmqCluster objects, which then creates a few other resources (e.g. StatefulSet, Service) that are essenital to run RabbitMQ. Alana might not want Cody to create PDBs (sensible since PDBs are not essential to run RabbitMQ). By automating the creation of a PDB, we are creating an object that Alana might not want Cody to. I believe automatically creating PDBs is a "worse offense" because Cody can reuse the PDB in its own applications, againsts Alana's intentions (perhaps is ok to have RabbitMQ with a PDB, but not ok to have other apps using it).

The same reasoning above can be applied to other resources e.g. Service, which could be used to expose certain applications that Alana does not wish to expose, however the Service object is essential to run RabbitMQ (inaccessible RabbitMQ is almost the same as not running).

Making the RabbitMQ cluster operator production safe removing the burden for users to manually create and delete PodDiscruptionBudgets feels more important to me than optimising for draining a dev cluster.

I disagree with this statement. We should not break any workflow, regardless of the tag ("dev" or "prod") of the cluster.

In summary, I think it's a good idea to have the operator create PodDisruptionBudgets by default allowing the user to override the values.

I would prefer to deliver this as part of the kubectl rabbitmq plugin. Something like kubectl rabbitmq create-pdb <instance>. I oppose to automatically create PDB objects as part of the reconcilliation.

@github-actions
Copy link

This issue has been marked as stale due to 60 days of inactivity. Stale issues will be closed after a further 30 days of inactivity; please remove the stale label in order to prevent this occurring.

@github-actions github-actions bot added the stale Issue or PR with long period of inactivity label Mar 27, 2021
@github-actions
Copy link

Closing stale issue due to further inactivity.

@github-actions github-actions bot added the closed-stale Issue or PR closed due to long period of inactivity label Apr 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
closed-stale Issue or PR closed due to long period of inactivity enhancement New feature or request stale Issue or PR with long period of inactivity
Projects
None yet
Development

No branches or pull requests

4 participants