Skip to content

e2e "ClusterServiceVersion when a csv exists specifying two replicas with one max unavailable remains in phase Succeeded when only one pod is available" failure #2553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
akihikokuroda opened this issue Jan 7, 2022 · 2 comments · Fixed by #2554
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@akihikokuroda
Copy link
Member

akihikokuroda commented Jan 7, 2022

Bug Report

e2e test in CI failure. This is listed in #2401.
Recent happening: https://github.com/operator-framework/operator-lifecycle-manager/runs/4718817543?check_suite_focus=true#step:4:9568

ClusterServiceVersion when a csv exists specifying two replicas with one max unavailable
  remains in phase Succeeded when only one pod is available
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:504

[BeforeEach] ClusterServiceVersion
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:55
[BeforeEach] when a csv exists specifying two replicas with one max unavailable
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:413
[It] remains in phase Succeeded when only one pod is available
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:504
[AfterEach] ClusterServiceVersion
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:60
cleaning up ephemeral test resources...
deleting test subscriptions...
deleting test installplans...
deleting test catalogsources...
deleting test crds...
deleting test csvs...
test resources deleted


• Failure [4.748 seconds]
ClusterServiceVersion
/home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:43
  when a csv exists specifying two replicas with one max unavailable
  /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:404
    remains in phase Succeeded when only one pod is available [It]
    /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:504

    Failed after 1.049s.
    Expected
        <v1alpha1.ClusterServiceVersionPhase>: InstallReady
    to equal
        <v1alpha1.ClusterServiceVersionPhase>: Succeeded

    /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:513

    Full Stack Trace
    github.com/operator-framework/operator-lifecycle-manager/test/e2e.glob..func6.5.2()
        /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/csv_e2e_test.go:513 +0x4c3
    github.com/operator-framework/operator-lifecycle-manager/test/e2e.TestEndToEnd(0xc0007ce780)
        /home/runner/work/operator-lifecycle-manager/operator-lifecycle-manager/test/e2e/e2e_test.go:59 +0x2d6
    testing.tRunner(0xc0007ce780, 0x3580d80)
        /opt/hostedtoolcache/go/1.16.12/x64/src/testing/testing.go:1193 +0xef
    created by testing.(*T).Run
        /opt/hostedtoolcache/go/1.16.12/x64/src/testing/testing.go:1238 +0x2b3

What did you do?
CI e2e test failed

What did you expect to see?
No failure

What did you see instead? Under which circumstances?
A clear and concise description of what you expected to happen (or insert a code snippet).

Environment

  • operator-lifecycle-manager version:
  • Kubernetes version information:
  • Kubernetes cluster kind:

Possible Solution

Additional context
Add any other context about the problem here.

@akihikokuroda akihikokuroda added the kind/bug Categorizes issue or PR as related to a bug. label Jan 7, 2022
@akihikokuroda
Copy link
Member Author

Here is the olm-operator log with some additional log entry(with ZZZZ).
My comments are with ####

#### CSV is Succeeded phase - e2e test waits for this condition before it deletes one of the pod of the deployment 
time="2022-01-06T20:35:16Z" level=debug msg="syncing CSV" csv=test-csv7mcjr id=hiOs1 namespace=operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=debug msg="annotations correct" annotationTargets= opgroupTargets=
time="2022-01-06T20:35:16Z" level=debug msg="csv in operatorgroup" csv=test-csv7mcjr id=AkFIK namespace=operators opgroup=global-operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=debug msg="no intersecting operatorgroups provide the same apis" apis= csv=test-csv7mcjr id=AkFIK namespace=operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=info msg="checking test-csv7mcjr"
time="2022-01-06T20:35:16Z" level=debug msg="checking if csv is replacing an older version"
#### The deployment still has only one available replicas (pod)
time="2022-01-06T20:35:16Z" level=debug msg="ZZZZ: DeploymentStatus: {ObservedGeneration:2 Replicas:2 UpdatedReplicas:2 ReadyReplicas:1 AvailableReplicas:1 UnavailableReplicas:1 Conditions:[{Type:Available Status:True LastUpdateTime:2022-01-06 20:35:16 +0000 UTC LastTransitionTime:2022-01-06 20:35:16 +0000 UTC Reason:MinimumReplicasAvailable Message:Deployment has minimum availability.} {Type:Progressing Status:True LastUpdateTime:2022-01-06 20:35:16 +0000 UTC LastTransitionTime:2022-01-06 20:35:10 +0000 UTC Reason:ReplicaSetUpdated Message:ReplicaSet \"deployment-646788556d\" is progressing.}] CollisionCount:<nil>}"
time="2022-01-06T20:35:16Z" level=debug msg="failed to get the owner subscription csv=test-csv7mcjr"
time="2022-01-06T20:35:16Z" level=debug msg="no env var to inject into csv" csv=test-csv7mcjr
time="2022-01-06T20:35:16Z" level=debug msg="copying CSV" csv=test-csv7mcjr id=v+1Qv namespace=operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=debug msg="copying csv to targets" csv=test-csv7mcjr id=v+1Qv namespace=operators phase=Succeeded targetNamespaces=
time="2022-01-06T20:35:16Z" level=debug msg="opgroup is global" csv=test-csv7mcjr opgroup=global-operators
time="2022-01-06T20:35:16Z" level=debug msg="global operator has correct global permissions" csv=test-csv7mcjr opgroup=global-operators
time="2022-01-06T20:35:16Z" level=debug msg="done syncing CSV" csv=test-csv7mcjr id=hiOs1 namespace=operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=debug msg="requeueing owner csvs from owner label" name=deployment namespace=operators self=
time="2022-01-06T20:35:16Z" level=debug msg="requeueing owner csvs" name=deployment namespace=operators self=
time="2022-01-06T20:35:16Z" level=debug msg="syncing CSV" csv=test-csv7mcjr id=ooBk5 namespace=operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=debug msg="annotations correct" annotationTargets= opgroupTargets=
time="2022-01-06T20:35:16Z" level=debug msg="csv in operatorgroup" csv=test-csv7mcjr id=8jLOl namespace=operators opgroup=global-operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=debug msg="no intersecting operatorgroups provide the same apis" apis= csv=test-csv7mcjr id=8jLOl namespace=operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=info msg="checking test-csv7mcjr"
time="2022-01-06T20:35:16Z" level=debug msg="checking if csv is replacing an older version"
2022-01-06T20:35:16.937Z	DEBUG	controllers.adoption	reconciling csv	{"request": "operators/test-csv7mcjr"}
time="2022-01-06T20:35:16Z" level=debug msg="copying CSV" csv=test-csv7mcjr id=hCYe4 namespace=operators phase=Succeeded
time="2022-01-06T20:35:16Z" level=debug msg="copying csv to targets" csv=test-csv7mcjr id=hCYe4 namespace=operators phase=Succeeded targetNamespaces=
2022-01-06T20:35:16.958Z	DEBUG	controllers.adoption	reconciling csv	{"request": "operators/test-csv7mcjr"}
time="2022-01-06T20:35:16Z" level=debug msg="requeueing owner csvs from owner label" name=deployment namespace=operators self=
time="2022-01-06T20:35:16Z" level=debug msg="requeueing owner csvs" name=deployment namespace=operators self=
#### The e2e test delete one of the pod (replica) for the deployment.  The number of the available replicas became 0.
time="2022-01-06T20:35:16Z" level=debug msg="ZZZZ: DeploymentStatus: {ObservedGeneration:2 Replicas:2 UpdatedReplicas:2 ReadyReplicas:0 AvailableReplicas:0 UnavailableReplicas:2 Conditions:[{Type:Available Status:False LastUpdateTime:2022-01-06 20:35:16 +0000 UTC LastTransitionTime:2022-01-06 20:35:16 +0000 UTC Reason:MinimumReplicasUnavailable Message:Deployment does not have minimum availability.} {Type:Progressing Status:True LastUpdateTime:2022-01-06 20:35:16 +0000 UTC LastTransitionTime:2022-01-06 20:35:10 +0000 UTC Reason:ReplicaSetUpdated Message:ReplicaSet \"deployment-646788556d\" is progressing.}] CollisionCount:<nil>}"
#### It caused the "operator not installed" condition.
time="2022-01-06T20:35:16Z" level=debug msg="operator not installed" error="waiting for deployment deployment to become ready: deployment \"deployment\" not available: Deployment does not have minimum availability."
time="2022-01-06T20:35:16Z" level=warning msg="unhealthy component: waiting for deployment deployment to become ready: deployment \"deployment\" not available: Deployment does not have minimum availability." csv=test-csv7mcjr id=8jLOl namespace=operators phase=Succeeded strategy=deployment
I0106 20:35:16.977736       1 event.go:282] Event(v1.ObjectReference{Kind:"ClusterServiceVersion", Namespace:"operators", Name:"test-csv7mcjr", UID:"1a51d7bb-de6a-4585-8743-ae2cb80b8448", APIVersion:"operators.coreos.com/v1alpha1", ResourceVersion:"172612", FieldPath:""}): type: 'Warning' reason: 'ComponentUnhealthy' installing: waiting for deployment deployment to become ready: deployment "deployment" not available: Deployment does not have minimum availability.
time="2022-01-06T20:35:17Z" level=debug msg="opgroup is global" csv=test-csv7mcjr opgroup=global-operators
time="2022-01-06T20:35:17Z" level=debug msg="global operator has correct global permissions" csv=test-csv7mcjr opgroup=global-operators
time="2022-01-06T20:35:17Z" level=debug msg="done syncing CSV" csv=test-csv7mcjr id=ooBk5 namespace=operators phase=Succeeded
time="2022-01-06T20:35:17Z" level=debug msg="syncing CSV" csv=test-csv7mcjr id=BznIV namespace=operators phase=Failed

@akihikokuroda
Copy link
Member Author

There is a timing gap between CSV becoming Succeed phase and actually 2 pods (replica) being up and ready. When the e2e test deletes a pod in this gap, The CSV become failed phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant