olm install - operatorhubio-catalog Readiness probe failed #5945

fredcsfaria · 2022-07-12T13:54:44Z

Bug Report

What did you do?

Install the olm using the "operator-sdk olm install"

What did you expect to see?

The pods in olm namespace in "ready" state.

What did you see instead? Under which circumstances?

The operatorhubio-catalog is not ready:

k get pods -n olm

-> describe pod operatorhubio-catalog-sxpwz -n olm

Environment

Minikube 1.25.2

Operator type:

/language go

Kubernetes cluster type:

$ operator-sdk version

operator-sdk version: "v1.22.1", commit: "46ab175459a775d2fb9f0454d0b4a8850dd745ed", kubernetes version: "1.24.1", go version: "go1.18.3", GOOS: "linux", GOARCH: "amd64"

$ go version (if language is Go)

go version go1.18 linux/amd64

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

Additional context

The text was updated successfully, but these errors were encountered:

varshaprasad96 · 2022-07-18T18:02:15Z

@fredcsfaria we would need more context on the cluster which is being used. Usually this error occurs when there are not enough hosts available to schedule the pods on, or maybe when there is a networking issue with the node where the particular pod is being scheduled. Are there any logs from the pods which is being started but not healthy? Or any errors from the API server which can be used to dig into this?

murphd40 · 2022-08-09T00:40:18Z

I'm also seeing this behaviour:

$ kubectl get po

NAME                                READY   STATUS             RESTARTS   AGE
catalog-operator-7f6fdbc478-6b8j2   1/1     Running            0          110m
olm-operator-7f5fbcf9f9-bhljf       1/1     Running            0          110m
operatorhubio-catalog-v5w7b         0/1     CrashLoopBackOff   7          14m
operatorhubio-catalog-x2hbh         0/1     CrashLoopBackOff   19         49m
packageserver-7cf6d4dc5d-4w7xc      1/1     Running            0          109m
packageserver-7cf6d4dc5d-lr8qv      1/1     Running            0          109m

$ kubectl describe po operatorhubio-catalog-v5w7b
Name:         operatorhubio-catalog-v5w7b
Namespace:    olm
Priority:     0
Node:         node1/10.10.168.144
Start Time:   Tue, 09 Aug 2022 00:12:37 +0000
Labels:       olm.catalogSource=operatorhubio-catalog
              olm.pod-spec-hash=6bfddc89d8
Annotations:  cluster-autoscaler.kubernetes.io/safe-to-evict: true
              cni.projectcalico.org/podIP: 10.233.90.178/32
              cni.projectcalico.org/podIPs: 10.233.90.178/32
Status:       Running
IP:           10.233.90.178
IPs:
  IP:  10.233.90.178
Containers:
  registry-server:
    Container ID:   docker://abc02bb38b14eb1fd7b93f14a8209223eeb587d057d75ccefe8c919740309600
    Image:          quay.io/operatorhubio/catalog:latest
    Image ID:       docker-pullable://quay.io/operatorhubio/catalog@sha256:b6492a9fd6fc07ac0975939cd5828151e47b730eda130c75a109bcee3b93a969
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 09 Aug 2022 00:27:19 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 09 Aug 2022 00:21:31 +0000
      Finished:     Tue, 09 Aug 2022 00:22:07 +0000
    Ready:          False
    Restart Count:  8
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=:50051] delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from operatorhubio-catalog-token-xrjtx (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  operatorhubio-catalog-token-xrjtx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  operatorhubio-catalog-token-xrjtx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  15m                   default-scheduler  Successfully assigned olm/operatorhubio-catalog-v5w7b to node1
  Normal   Pulled     15m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 405.142183ms
  Normal   Created    14m (x2 over 15m)     kubelet            Created container registry-server
  Normal   Pulled     14m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 361.608421ms
  Normal   Started    14m (x2 over 15m)     kubelet            Started container registry-server
  Warning  Unhealthy  13m (x7 over 14m)     kubelet            Readiness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Pulling    13m (x3 over 15m)     kubelet            Pulling image "quay.io/operatorhubio/catalog:latest"
  Warning  Unhealthy  13m (x6 over 14m)     kubelet            Liveness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Killing    13m (x2 over 14m)     kubelet            Container registry-server failed liveness probe, will be restarted
  Normal   Pulled     13m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 514.682595ms
  Warning  BackOff    4m56s (x28 over 11m)  kubelet            Back-off restarting failed container

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

I ran the following experiment, manually running the entrypoint of the container

$ docker run --rm -it --entrypoint=sh quay.io/operatorhubio/catalog:latest
sh-4.4$ /bin/opm serve /configs/
WARN[0000] unable to set termination log path            error="open /dev/termination-log: permission denied"

INFO[0048] serving registry                              configs=/configs/ port=50051

It took around 50 seconds for the serving registry log message to appear. I believe that is the source of the issue I am experiencing - the liveness/readiness probes are failing because it takes the pod too long to initialize and the pod gets stuck in a restart loop because the initialDelaySeconds for the probes is too short.

murphd40 · 2022-08-10T19:02:03Z

Looks like my issue will be fixed in the next release: This PR tackles the issue 👍 Thanks

openshift-bot · 2022-11-09T01:01:00Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

jmrodri · 2022-11-22T05:36:25Z

Fixed by operator-framework/operator-lifecycle-manager#2791

sybadm · 2023-11-17T12:40:15Z

K8S version: v1.27.3

I'm still facing this issue

kubectl create -f \ https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/crds.yaml kubectl create -f \ https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/olm.yaml

`Events:
Type Reason Age From Message

Normal Scheduled 75s default-scheduler Successfully assigned olm/operatorhubio-catalog-n2cp2 to aks-systempool-27289948-vmss00000c
Normal Pulled 73s kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 716.734417ms (716.844818ms including waiting)
Normal Pulling 34s (x2 over 74s) kubelet Pulling image "quay.io/operatorhubio/catalog:latest"
Normal Killing 34s kubelet Container registry-server failed liveness probe, will be restarted
Warning Unhealthy 34s kubelet Readiness probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task d796b71bb52dc8516ad4f1b347934703c842b488d855fde20918bd1021c7172a not found: not found
Normal Created 33s (x2 over 73s) kubelet Created container registry-server
Normal Started 33s (x2 over 73s) kubelet Started container registry-server
Normal Pulled 33s kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 605.963458ms (605.976558ms including waiting)
Warning Unhealthy 4s (x7 over 64s) kubelet Readiness probe failed: timeout: failed to connect service ":50051" within 1s
Warning Unhealthy 4s (x5 over 54s) kubelet Liveness probe failed: timeout: failed to connect service ":50051" within 1s
`

tpalkot · 2024-06-19T19:44:00Z

I have the same issue. I'm just trying to do a basic setup to get through a tutorial. Microk8s in case it matters.

operator-sdk version: "v1.35.0", commit: "e95abdbd5ccb7ca0fd586e0c6f578e491b0a025b", kubernetes version: "v1.28.0", go version: "go1.21.11", GOOS: "linux", GOARCH: "amd64"

Normal Started 3m55s (x2 over 5m36s) kubelet Started container registry-server
Warning Unhealthy 37s (x30 over 5m26s) kubelet Startup probe failed: timeout: failed to connect service ":50051" within 1s

jberkhahn added the triage/support Indicates an issue that is a support question. label Jul 18, 2022

jberkhahn assigned jmrodri Jul 18, 2022

jberkhahn added this to the Backlog milestone Jul 18, 2022

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 9, 2022

jmrodri closed this as completed Nov 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

olm install - operatorhubio-catalog Readiness probe failed #5945

olm install - operatorhubio-catalog Readiness probe failed #5945

fredcsfaria commented Jul 12, 2022

varshaprasad96 commented Jul 18, 2022

murphd40 commented Aug 9, 2022

murphd40 commented Aug 10, 2022

openshift-bot commented Nov 9, 2022

jmrodri commented Nov 22, 2022

sybadm commented Nov 17, 2023 •

edited

Loading

tpalkot commented Jun 19, 2024

olm install - operatorhubio-catalog Readiness probe failed #5945

olm install - operatorhubio-catalog Readiness probe failed #5945

Comments

fredcsfaria commented Jul 12, 2022

Bug Report

What did you do?

What did you expect to see?

What did you see instead? Under which circumstances?

Environment

Possible Solution

Additional context

varshaprasad96 commented Jul 18, 2022

murphd40 commented Aug 9, 2022

murphd40 commented Aug 10, 2022

openshift-bot commented Nov 9, 2022

jmrodri commented Nov 22, 2022

sybadm commented Nov 17, 2023 • edited Loading

tpalkot commented Jun 19, 2024

sybadm commented Nov 17, 2023 •

edited

Loading