Skip to content

olm install - operatorhubio-catalog Readiness probe failed #5945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fredcsfaria opened this issue Jul 12, 2022 · 7 comments
Closed

olm install - operatorhubio-catalog Readiness probe failed #5945

fredcsfaria opened this issue Jul 12, 2022 · 7 comments
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/support Indicates an issue that is a support question.
Milestone

Comments

@fredcsfaria
Copy link

Bug Report

What did you do?

Install the olm using the "operator-sdk olm install"

What did you expect to see?

The pods in olm namespace in "ready" state.

What did you see instead? Under which circumstances?

The operatorhubio-catalog is not ready:

k get pods -n olm
image

-> describe pod operatorhubio-catalog-sxpwz -n olm
image

Environment

Minikube 1.25.2

Operator type:

/language go

Kubernetes cluster type:

$ operator-sdk version

operator-sdk version: "v1.22.1", commit: "46ab175459a775d2fb9f0454d0b4a8850dd745ed", kubernetes version: "1.24.1", go version: "go1.18.3", GOOS: "linux", GOARCH: "amd64"

$ go version (if language is Go)

go version go1.18 linux/amd64

$ kubectl version

Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:12:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.3", GitCommit:"816c97ab8cff8a1c72eccca1026f7820e93e0d25", GitTreeState:"clean", BuildDate:"2022-01-25T21:19:12Z", GoVersion:"go1.17.6", Compiler:"gc", Platform:"linux/amd64"}

Possible Solution

Additional context

@varshaprasad96
Copy link
Member

@fredcsfaria we would need more context on the cluster which is being used. Usually this error occurs when there are not enough hosts available to schedule the pods on, or maybe when there is a networking issue with the node where the particular pod is being scheduled. Are there any logs from the pods which is being started but not healthy? Or any errors from the API server which can be used to dig into this?

@jberkhahn jberkhahn added the triage/support Indicates an issue that is a support question. label Jul 18, 2022
@jberkhahn jberkhahn added this to the Backlog milestone Jul 18, 2022
@murphd40
Copy link

murphd40 commented Aug 9, 2022

I'm also seeing this behaviour:

$ kubectl get po

NAME                                READY   STATUS             RESTARTS   AGE
catalog-operator-7f6fdbc478-6b8j2   1/1     Running            0          110m
olm-operator-7f5fbcf9f9-bhljf       1/1     Running            0          110m
operatorhubio-catalog-v5w7b         0/1     CrashLoopBackOff   7          14m
operatorhubio-catalog-x2hbh         0/1     CrashLoopBackOff   19         49m
packageserver-7cf6d4dc5d-4w7xc      1/1     Running            0          109m
packageserver-7cf6d4dc5d-lr8qv      1/1     Running            0          109m
$ kubectl describe po operatorhubio-catalog-v5w7b
Name:         operatorhubio-catalog-v5w7b
Namespace:    olm
Priority:     0
Node:         node1/10.10.168.144
Start Time:   Tue, 09 Aug 2022 00:12:37 +0000
Labels:       olm.catalogSource=operatorhubio-catalog
              olm.pod-spec-hash=6bfddc89d8
Annotations:  cluster-autoscaler.kubernetes.io/safe-to-evict: true
              cni.projectcalico.org/podIP: 10.233.90.178/32
              cni.projectcalico.org/podIPs: 10.233.90.178/32
Status:       Running
IP:           10.233.90.178
IPs:
  IP:  10.233.90.178
Containers:
  registry-server:
    Container ID:   docker://abc02bb38b14eb1fd7b93f14a8209223eeb587d057d75ccefe8c919740309600
    Image:          quay.io/operatorhubio/catalog:latest
    Image ID:       docker-pullable://quay.io/operatorhubio/catalog@sha256:b6492a9fd6fc07ac0975939cd5828151e47b730eda130c75a109bcee3b93a969
    Port:           50051/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 09 Aug 2022 00:27:19 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 09 Aug 2022 00:21:31 +0000
      Finished:     Tue, 09 Aug 2022 00:22:07 +0000
    Ready:          False
    Restart Count:  8
    Requests:
      cpu:        10m
      memory:     50Mi
    Liveness:     exec [grpc_health_probe -addr=:50051] delay=10s timeout=5s period=10s #success=1 #failure=3
    Readiness:    exec [grpc_health_probe -addr=:50051] delay=5s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from operatorhubio-catalog-token-xrjtx (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  operatorhubio-catalog-token-xrjtx:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  operatorhubio-catalog-token-xrjtx
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
Tolerations:     node-role.kubernetes.io/master:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  15m                   default-scheduler  Successfully assigned olm/operatorhubio-catalog-v5w7b to node1
  Normal   Pulled     15m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 405.142183ms
  Normal   Created    14m (x2 over 15m)     kubelet            Created container registry-server
  Normal   Pulled     14m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 361.608421ms
  Normal   Started    14m (x2 over 15m)     kubelet            Started container registry-server
  Warning  Unhealthy  13m (x7 over 14m)     kubelet            Readiness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Pulling    13m (x3 over 15m)     kubelet            Pulling image "quay.io/operatorhubio/catalog:latest"
  Warning  Unhealthy  13m (x6 over 14m)     kubelet            Liveness probe failed: timeout: failed to connect service ":50051" within 1s
  Normal   Killing    13m (x2 over 14m)     kubelet            Container registry-server failed liveness probe, will be restarted
  Normal   Pulled     13m                   kubelet            Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 514.682595ms
  Warning  BackOff    4m56s (x28 over 11m)  kubelet            Back-off restarting failed container
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:32:49Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"linux/amd64"}

I ran the following experiment, manually running the entrypoint of the container

$ docker run --rm -it --entrypoint=sh quay.io/operatorhubio/catalog:latest
sh-4.4$ /bin/opm serve /configs/
WARN[0000] unable to set termination log path            error="open /dev/termination-log: permission denied"

INFO[0048] serving registry                              configs=/configs/ port=50051

It took around 50 seconds for the serving registry log message to appear. I believe that is the source of the issue I am experiencing - the liveness/readiness probes are failing because it takes the pod too long to initialize and the pod gets stuck in a restart loop because the initialDelaySeconds for the probes is too short.

@murphd40
Copy link

Looks like my issue will be fixed in the next release: This PR tackles the issue 👍 Thanks

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 9, 2022
@jmrodri
Copy link
Member

jmrodri commented Nov 22, 2022

@jmrodri jmrodri closed this as completed Nov 22, 2022
@sybadm
Copy link

sybadm commented Nov 17, 2023

K8S version: v1.27.3

I'm still facing this issue

kubectl create -f \ https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/crds.yaml kubectl create -f \ https://raw.githubusercontent.com/operator-framework/operator-lifecycle-manager/master/deploy/upstream/quickstart/olm.yaml

`Events:
Type Reason Age From Message


Normal Scheduled 75s default-scheduler Successfully assigned olm/operatorhubio-catalog-n2cp2 to aks-systempool-27289948-vmss00000c
Normal Pulled 73s kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 716.734417ms (716.844818ms including waiting)
Normal Pulling 34s (x2 over 74s) kubelet Pulling image "quay.io/operatorhubio/catalog:latest"
Normal Killing 34s kubelet Container registry-server failed liveness probe, will be restarted
Warning Unhealthy 34s kubelet Readiness probe errored: rpc error: code = NotFound desc = failed to exec in container: failed to load task: no running task found: task d796b71bb52dc8516ad4f1b347934703c842b488d855fde20918bd1021c7172a not found: not found
Normal Created 33s (x2 over 73s) kubelet Created container registry-server
Normal Started 33s (x2 over 73s) kubelet Started container registry-server
Normal Pulled 33s kubelet Successfully pulled image "quay.io/operatorhubio/catalog:latest" in 605.963458ms (605.976558ms including waiting)
Warning Unhealthy 4s (x7 over 64s) kubelet Readiness probe failed: timeout: failed to connect service ":50051" within 1s
Warning Unhealthy 4s (x5 over 54s) kubelet Liveness probe failed: timeout: failed to connect service ":50051" within 1s
`

@tpalkot
Copy link

tpalkot commented Jun 19, 2024

I have the same issue. I'm just trying to do a basic setup to get through a tutorial. Microk8s in case it matters.

operator-sdk version: "v1.35.0", commit: "e95abdbd5ccb7ca0fd586e0c6f578e491b0a025b", kubernetes version: "v1.28.0", go version: "go1.21.11", GOOS: "linux", GOARCH: "amd64"

Normal Started 3m55s (x2 over 5m36s) kubelet Started container registry-server
Warning Unhealthy 37s (x30 over 5m26s) kubelet Startup probe failed: timeout: failed to connect service ":50051" within 1s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/support Indicates an issue that is a support question.
Projects
None yet
Development

No branches or pull requests

8 participants