cluster operator fails to deploy rabbitmq instance #537

thorion3006 · 2021-01-03T16:11:05Z

Describe the bug

Rabbitmq cluster operator v1.3.0 fails to deploy rabbitmq instance in EKS v1.18 cluster

To Reproduce

Steps to reproduce the behavior:

kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/download/v1.3.0/cluster-operator.yml
kubectl apply -f rabbitmq.yaml -n resources

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: rabbitmq
            topologyKey: topology.kubernetes.io/zone
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: rabbitmq
          topologyKey: kubernetes.io/hostname
  override:
    service:
      spec:
        ports:
          - name: management-http
            protocol: TCP
            port: 15672
            targetPort: 15672
          - name: amqp-tcp
            protocol: TCP
            port: 5672
            targetPort: 5672
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                env:
                  - name: RABBITMQ_QUORUM_DIR
                    value: /var/lib/rabbitmq/quorum-segments
                volumeMounts:
                  - mountPath: /etc/rabbitmq/rabbitmq_definitions.json
                    name: definitions
                  - mountPath: /var/lib/rabbitmq/quorum-segments
                    name: quorum-segments
                  - mountPath: /var/lib/rabbitmq/quorum-wal
                    name: quorum-wal
            nodeSelector:
              ebs-optimized: 'true'
            volumes:
              - name: definitions
                configMap:
                  name: rabbitmq-definitions # Name of the ConfigMap which contains definitions you wish to import
        volumeClaimTemplates:
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: quorum-wal
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: quorum-segments
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
  persistence:
    storage: 10Gi
  rabbitmq:
    additionalConfig: |
      channel_max = 200
      disk_free_limit.relative = 1.5
      load_definitions = /etc/rabbitmq/rabbitmq_definitions.json # Path to the mounted definitions file
      management.path_prefix = /rabbitmq
      vm_memory_high_watermark.relative = 0.7
      vm_memory_high_watermark_paging_ratio = 0.9
    additionalPlugins:
      - rabbitmq_top
    advancedConfig: |
      [
        {ra, [
              {wal_data_dir, '/var/lib/rabbitmq/quorum-wal'}
          ]},
        {rabbit, [
          {quorum_cluster_size, 5},
          {quorum_commands_soft_limit, 1024} # maximum number of unconfirmed messages a channel accepts before entering flow
        ]}
      ].
  replicas: 3
  resources:
    requests:
      cpu: '1'
      memory: 2Gi
    limits:
      cpu: '1'
      memory: 2Gi

error log:

2021-01-03T15:50:45.239Z	ERROR	controller	Reconciler error	{"reconcilerGroup": "rabbitmq.com", "reconcilerKind": "RabbitmqCluster", "controller": "rabbitmqcluster", "name": "rabbitmq", "namespace": "resources", "error": "failed setting controller reference: cluster-scoped resource must not have a namespace-scoped owner, owner's namespace resources"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:246
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90

Expected behavior
A rabbitmq instance should be deployed to resources namespace.

Version and environment information

RabbitMQ: 3.8.9
RabbitMQ Cluster Operator: 1.3.0
Kubernetes: 1.18
Cloud provider or hardware configuration: EKS

The text was updated successfully, but these errors were encountered:

mkuratczyk · 2021-01-04T20:53:32Z

Thank you. I was able to reproduce the issue and we'll look into it shortly.

mkuratczyk · 2021-01-05T18:56:12Z

I've found two issues:

You don't specify namespace for the quorum-wal and quorum-segments PVCs. I can't tell yet whether such YAML should be considered invalid or whether it's a bug in the operator that should be able to handle that. We'll discuss this tomorrow.
You don't specify persistence PVC. Given that override works as a YAML patch, when you specify any PVCs, you have to specify them all - otherwise the default persistence PVC gets overwritten/removed. Multiple-disks example defines all of them.

After adding the namespace and the missing PVC, I was able to deploy a cluster.

I also have some questions/observations, unrelated to the issue:

Why do you override port names in the service? What is the benefit of that?
You tune disk-level access by separating the volumes but at the same time you assign 1 CPU and 2GB of RAM to this instance which is very low (we set that as default to make it easy to get started but definitely don't recommend for any real use). Have you done any testing to validate this setup?

Full YAML that works for me:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: rabbitmq
            topologyKey: topology.kubernetes.io/zone
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: rabbitmq
          topologyKey: kubernetes.io/hostname
  override:
    service:
      spec:
        ports:
          - name: management-http
            protocol: TCP
            port: 15672
            targetPort: 15672
          - name: amqp-tcp
            protocol: TCP
            port: 5672
            targetPort: 5672
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                env:
                  - name: RABBITMQ_QUORUM_DIR
                    value: /var/lib/rabbitmq/quorum-segments
                volumeMounts:
                  - mountPath: /etc/rabbitmq/rabbitmq_definitions.json
                    name: definitions
                  - mountPath: /var/lib/rabbitmq/quorum-segments
                    name: quorum-segments
                  - mountPath: /var/lib/rabbitmq/quorum-wal
                    name: quorum-wal
            nodeSelector:
              ebs-optimized: "true"
            volumes:
              - name: definitions
                configMap:
                  name: rabbitmq-definitions # Name of the ConfigMap which contains definitions you wish to import
        volumeClaimTemplates:
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: persistence
              namespace: default
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
              volumeMode: Filesystem
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: quorum-wal
              namespace: default
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: quorum-segments
              namespace: default
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
  persistence:
    storage: 10Gi
  rabbitmq:
    additionalConfig: |
      channel_max = 200
      disk_free_limit.relative = 1.5
      load_definitions = /etc/rabbitmq/rabbitmq_definitions.json # Path to the mounted definitions file
      management.path_prefix = /rabbitmq
      vm_memory_high_watermark.relative = 0.7
      vm_memory_high_watermark_paging_ratio = 0.9
    additionalPlugins:
      - rabbitmq_top
    advancedConfig: |
      [
        {ra, [
              {wal_data_dir, '/var/lib/rabbitmq/quorum-wal'}
          ]},
        {rabbit, [
          {quorum_cluster_size, 5},
          {quorum_commands_soft_limit, 1024} # maximum number of unconfirmed messages a channel accepts before entering flow
        ]}
      ].
  replicas: 3
  resources:
    requests:
      cpu: "1"
      memory: 2Gi
    limits:
      cpu: "1"
      memory: 2Gi

- this is related to bug reported in: #537 - k8s allows pvc template in sts to not have namespace specified and assumes it's the same namespace as the sts - operator needs to set namespace because controller reference can only be set when both object name and namespace are specified

thorion3006 · 2021-01-06T13:34:06Z

@mkuratczyk thanks, your yaml does work.

You don't specify namespace for the quorum-wal and quorum-segments PVCs. I can't tell yet whether such YAML should be considered invalid or whether it's a bug in the operator that should be able to handle that. We'll discuss this tomorrow.

Isn't the default behaviour for volumeClaimTemplates or as a matter of fact any resource scoped to a namespace, to inherit the namespace from the parent resource?

Why do you override port names in the service? What is the benefit of that?

I did it to be in line with the best practices for istio. Istio needs the service names to be suffixed with the protocol it uses.

You tune disk-level access by separating the volumes but at the same time you assign 1 CPU and 2GB of RAM to this instance which is very low (we set that as default to make it easy to get started but definitely don't recommend for any real use). Have you done any testing to validate this setup?

I arrived at 2gigs from this doc: https://www.rabbitmq.com/quorum-queues.html#resource-use

Because memory deallocation may take some time, we recommend that the RabbitMQ node is allocated at least 3 times the memory of the default WAL file size limit. More will be required in high-throughput systems. 4 times is a good starting point for those.

But I do think, I'll have to up it to 4gigs before using in prod env. I just wanted to see if having a separate log reduces memory usage.

thorion3006 · 2021-01-06T13:51:00Z

how do you rename the ports for {cluster-name}-nodes service?

- this is related to bug reported in: #537 - k8s allows pvc template in sts to not have namespace specified and assumes it's the same namespace as the sts - operator needs to set namespace because controller reference can only be set when both object name and namespace are specified

mkuratczyk · 2021-01-06T14:21:03Z

It was indeed a bug in the Operator: Always set PVC override namespace to sts namespace #545. If I understand correctly, while it would inherit the namespace once deployed, we have to explicitly set it before deploying, to be able to set the owner reference because otherwise controller-runtime throws an error. Either way, it's fixed. Thanks @ChunyiLyu!
Regarding ports: we may look into renaming them by default if that's the best practice for Istio. However, I'm not sure what the value is to be honest - eg. we are working on RabbitMQ Streams which use a custom protocol. I'm not even sure it has a name and it certainly doesn't have a well-recognized name.
You can't change the port names in the headless service - we don't expose it in the override right now. How would you call them if you could?

On a general note - we expose the override feature to allow unusual customizations and as an option to take advantage of Kubernetes features we did not anticipate. When you see a need to use this feature, and you believe it's a common use case (as in the example of Istio's best practices), please report an issue - you can still use the override as a stepping stone (or in case we decide not to implement a given feature) but our goal is definitely not to force users to maintain extensive override values.

thorion3006 · 2021-01-06T19:36:46Z

Overriding the headless service's port names isn't important, I just wanted to do it, so istio doesn't show a warning on validation. As for the istio best practices that are currently missing:

Service Port entries should be named protocol-name, eg: http-management
Pods should have labels with app and version keys.

Should I create a new issue for this?

thorion3006 added the bug Something isn't working label Jan 3, 2021

ChunyiLyu mentioned this issue Jan 6, 2021

Always set PVC override namespace to sts namespace #545

Merged

thorion3006 closed this as completed Jan 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster operator fails to deploy rabbitmq instance #537

cluster operator fails to deploy rabbitmq instance #537

thorion3006 commented Jan 3, 2021 •

edited

Loading

mkuratczyk commented Jan 4, 2021

mkuratczyk commented Jan 5, 2021

thorion3006 commented Jan 6, 2021

thorion3006 commented Jan 6, 2021 •

edited

Loading

mkuratczyk commented Jan 6, 2021

thorion3006 commented Jan 6, 2021 •

edited

Loading

cluster operator fails to deploy rabbitmq instance #537

cluster operator fails to deploy rabbitmq instance #537

Comments

thorion3006 commented Jan 3, 2021 • edited Loading

Describe the bug

To Reproduce

Version and environment information

mkuratczyk commented Jan 4, 2021

mkuratczyk commented Jan 5, 2021

thorion3006 commented Jan 6, 2021

thorion3006 commented Jan 6, 2021 • edited Loading

mkuratczyk commented Jan 6, 2021

thorion3006 commented Jan 6, 2021 • edited Loading

thorion3006 commented Jan 3, 2021 •

edited

Loading

thorion3006 commented Jan 6, 2021 •

edited

Loading

thorion3006 commented Jan 6, 2021 •

edited

Loading