Skip to content

cluster operator fails to deploy rabbitmq instance #537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
thorion3006 opened this issue Jan 3, 2021 · 6 comments
Closed

cluster operator fails to deploy rabbitmq instance #537

thorion3006 opened this issue Jan 3, 2021 · 6 comments
Labels
bug Something isn't working

Comments

@thorion3006
Copy link

thorion3006 commented Jan 3, 2021

Describe the bug

Rabbitmq cluster operator v1.3.0 fails to deploy rabbitmq instance in EKS v1.18 cluster

To Reproduce

Steps to reproduce the behavior:

  1. kubectl apply -f https://github.com/rabbitmq/cluster-operator/releases/download/v1.3.0/cluster-operator.yml
  2. kubectl apply -f rabbitmq.yaml -n resources
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: rabbitmq
            topologyKey: topology.kubernetes.io/zone
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: rabbitmq
          topologyKey: kubernetes.io/hostname
  override:
    service:
      spec:
        ports:
          - name: management-http
            protocol: TCP
            port: 15672
            targetPort: 15672
          - name: amqp-tcp
            protocol: TCP
            port: 5672
            targetPort: 5672
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                env:
                  - name: RABBITMQ_QUORUM_DIR
                    value: /var/lib/rabbitmq/quorum-segments
                volumeMounts:
                  - mountPath: /etc/rabbitmq/rabbitmq_definitions.json
                    name: definitions
                  - mountPath: /var/lib/rabbitmq/quorum-segments
                    name: quorum-segments
                  - mountPath: /var/lib/rabbitmq/quorum-wal
                    name: quorum-wal
            nodeSelector:
              ebs-optimized: 'true'
            volumes:
              - name: definitions
                configMap:
                  name: rabbitmq-definitions # Name of the ConfigMap which contains definitions you wish to import
        volumeClaimTemplates:
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: quorum-wal
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: quorum-segments
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
  persistence:
    storage: 10Gi
  rabbitmq:
    additionalConfig: |
      channel_max = 200
      disk_free_limit.relative = 1.5
      load_definitions = /etc/rabbitmq/rabbitmq_definitions.json # Path to the mounted definitions file
      management.path_prefix = /rabbitmq
      vm_memory_high_watermark.relative = 0.7
      vm_memory_high_watermark_paging_ratio = 0.9
    additionalPlugins:
      - rabbitmq_top
    advancedConfig: |
      [
        {ra, [
              {wal_data_dir, '/var/lib/rabbitmq/quorum-wal'}
          ]},
        {rabbit, [
          {quorum_cluster_size, 5},
          {quorum_commands_soft_limit, 1024} # maximum number of unconfirmed messages a channel accepts before entering flow
        ]}
      ].
  replicas: 3
  resources:
    requests:
      cpu: '1'
      memory: 2Gi
    limits:
      cpu: '1'
      memory: 2Gi
  1. error log:
2021-01-03T15:50:45.239Z	ERROR	controller	Reconciler error	{"reconcilerGroup": "rabbitmq.com", "reconcilerKind": "RabbitmqCluster", "controller": "rabbitmqcluster", "name": "rabbitmq", "namespace": "resources", "error": "failed setting controller reference: cluster-scoped resource must not have a namespace-scoped owner, owner's namespace resources"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:246
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:218
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:197
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/[email protected]/pkg/util/wait/wait.go:90

Expected behavior
A rabbitmq instance should be deployed to resources namespace.

Version and environment information

  • RabbitMQ: 3.8.9
  • RabbitMQ Cluster Operator: 1.3.0
  • Kubernetes: 1.18
  • Cloud provider or hardware configuration: EKS
@thorion3006 thorion3006 added the bug Something isn't working label Jan 3, 2021
@mkuratczyk
Copy link
Collaborator

Thank you. I was able to reproduce the issue and we'll look into it shortly.

@mkuratczyk
Copy link
Collaborator

I've found two issues:

  1. You don't specify namespace for the quorum-wal and quorum-segments PVCs. I can't tell yet whether such YAML should be considered invalid or whether it's a bug in the operator that should be able to handle that. We'll discuss this tomorrow.
  2. You don't specify persistence PVC. Given that override works as a YAML patch, when you specify any PVCs, you have to specify them all - otherwise the default persistence PVC gets overwritten/removed. Multiple-disks example defines all of them.

After adding the namespace and the missing PVC, I was able to deploy a cluster.

I also have some questions/observations, unrelated to the issue:

  1. Why do you override port names in the service? What is the benefit of that?
  2. You tune disk-level access by separating the volumes but at the same time you assign 1 CPU and 2GB of RAM to this instance which is very low (we set that as default to make it easy to get started but definitely don't recommend for any real use). Have you done any testing to validate this setup?

Full YAML that works for me:

apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: rabbitmq
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 50
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: rabbitmq
            topologyKey: topology.kubernetes.io/zone
      requiredDuringSchedulingIgnoredDuringExecution:
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: rabbitmq
          topologyKey: kubernetes.io/hostname
  override:
    service:
      spec:
        ports:
          - name: management-http
            protocol: TCP
            port: 15672
            targetPort: 15672
          - name: amqp-tcp
            protocol: TCP
            port: 5672
            targetPort: 5672
    statefulSet:
      spec:
        template:
          spec:
            containers:
              - name: rabbitmq
                env:
                  - name: RABBITMQ_QUORUM_DIR
                    value: /var/lib/rabbitmq/quorum-segments
                volumeMounts:
                  - mountPath: /etc/rabbitmq/rabbitmq_definitions.json
                    name: definitions
                  - mountPath: /var/lib/rabbitmq/quorum-segments
                    name: quorum-segments
                  - mountPath: /var/lib/rabbitmq/quorum-wal
                    name: quorum-wal
            nodeSelector:
              ebs-optimized: "true"
            volumes:
              - name: definitions
                configMap:
                  name: rabbitmq-definitions # Name of the ConfigMap which contains definitions you wish to import
        volumeClaimTemplates:
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: persistence
              namespace: default
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
              volumeMode: Filesystem
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: quorum-wal
              namespace: default
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
          - apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: quorum-segments
              namespace: default
            spec:
              accessModes:
                - ReadWriteOnce
              resources:
                requests:
                  storage: 10Gi
  persistence:
    storage: 10Gi
  rabbitmq:
    additionalConfig: |
      channel_max = 200
      disk_free_limit.relative = 1.5
      load_definitions = /etc/rabbitmq/rabbitmq_definitions.json # Path to the mounted definitions file
      management.path_prefix = /rabbitmq
      vm_memory_high_watermark.relative = 0.7
      vm_memory_high_watermark_paging_ratio = 0.9
    additionalPlugins:
      - rabbitmq_top
    advancedConfig: |
      [
        {ra, [
              {wal_data_dir, '/var/lib/rabbitmq/quorum-wal'}
          ]},
        {rabbit, [
          {quorum_cluster_size, 5},
          {quorum_commands_soft_limit, 1024} # maximum number of unconfirmed messages a channel accepts before entering flow
        ]}
      ].
  replicas: 3
  resources:
    requests:
      cpu: "1"
      memory: 2Gi
    limits:
      cpu: "1"
      memory: 2Gi

ChunyiLyu added a commit that referenced this issue Jan 6, 2021
- this is related to bug reported in: #537
-  k8s allows pvc template in sts to not have namespace
specified and assumes it's the same namespace as the sts
- operator needs to set namespace because controller
reference can only be set when both object name and
namespace are specified
@thorion3006
Copy link
Author

@mkuratczyk thanks, your yaml does work.

You don't specify namespace for the quorum-wal and quorum-segments PVCs. I can't tell yet whether such YAML should be considered invalid or whether it's a bug in the operator that should be able to handle that. We'll discuss this tomorrow.

Isn't the default behaviour for volumeClaimTemplates or as a matter of fact any resource scoped to a namespace, to inherit the namespace from the parent resource?

Why do you override port names in the service? What is the benefit of that?

I did it to be in line with the best practices for istio. Istio needs the service names to be suffixed with the protocol it uses.

You tune disk-level access by separating the volumes but at the same time you assign 1 CPU and 2GB of RAM to this instance which is very low (we set that as default to make it easy to get started but definitely don't recommend for any real use). Have you done any testing to validate this setup?

I arrived at 2gigs from this doc: https://www.rabbitmq.com/quorum-queues.html#resource-use

Because memory deallocation may take some time, we recommend that the RabbitMQ node is allocated at least 3 times the memory of the default WAL file size limit. More will be required in high-throughput systems. 4 times is a good starting point for those.

But I do think, I'll have to up it to 4gigs before using in prod env. I just wanted to see if having a separate log reduces memory usage.

@thorion3006
Copy link
Author

thorion3006 commented Jan 6, 2021

how do you rename the ports for {cluster-name}-nodes service?

ChunyiLyu added a commit that referenced this issue Jan 6, 2021
- this is related to bug reported in: #537
-  k8s allows pvc template in sts to not have namespace
specified and assumes it's the same namespace as the sts
- operator needs to set namespace because controller
reference can only be set when both object name and
namespace are specified
@mkuratczyk
Copy link
Collaborator

  1. It was indeed a bug in the Operator: Always set PVC override namespace to sts namespace #545. If I understand correctly, while it would inherit the namespace once deployed, we have to explicitly set it before deploying, to be able to set the owner reference because otherwise controller-runtime throws an error. Either way, it's fixed. Thanks @ChunyiLyu!

  2. Regarding ports: we may look into renaming them by default if that's the best practice for Istio. However, I'm not sure what the value is to be honest - eg. we are working on RabbitMQ Streams which use a custom protocol. I'm not even sure it has a name and it certainly doesn't have a well-recognized name.

  3. You can't change the port names in the headless service - we don't expose it in the override right now. How would you call them if you could?

On a general note - we expose the override feature to allow unusual customizations and as an option to take advantage of Kubernetes features we did not anticipate. When you see a need to use this feature, and you believe it's a common use case (as in the example of Istio's best practices), please report an issue - you can still use the override as a stepping stone (or in case we decide not to implement a given feature) but our goal is definitely not to force users to maintain extensive override values.

@thorion3006
Copy link
Author

thorion3006 commented Jan 6, 2021

Overriding the headless service's port names isn't important, I just wanted to do it, so istio doesn't show a warning on validation. As for the istio best practices that are currently missing:

  1. Service Port entries should be named protocol-name, eg: http-management
  2. Pods should have labels with app and version keys.

Should I create a new issue for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants