Skip to content

wait for catalogsource status ready before creating subscription #2601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions test/e2e/subscription_e2e_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,8 @@ var _ = Describe("Subscription", func() {
}

_, teardown = createInternalCatalogSource(ctx.Ctx().KubeClient(), ctx.Ctx().OperatorClient(), "test-catalog", generatedNamespace.GetName(), packages, crds, csvs)
_, err := fetchCatalogSourceOnStatus(ctx.Ctx().OperatorClient(), "test-catalog", generatedNamespace.GetName(), catalogSourceRegistryPodSynced)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch - I was under the impression that we made a clean sweep of anywhere we instantiate a grpc-based CatalogSource, and then subsequently create a Subscription, but this one feels easy to catch given the setup isn't super readable. It would be nice to avoid having to hardcode the "test-catalog" in two places here, but I won't block the PR for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm somewhat second guessing this the more I think about it. I haven't played around with this locally, but looking at that test case failure output, it's not immediately clear to me why we need to simply wait for the CatalogSource to be reporting a "ready" state. Were you able to reproduce this test case failure locally?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't reproduce this error locally but I saw this in the catalog-operator log of the CI e2e failure.

2022-01-19T20:00:24.250526731Z stderr F time="2022-01-19T20:00:24Z" level=debug msg="syncing catsrc" id=Zfz5K source=test-catalog
2022-01-19T20:00:24.250530131Z stderr F time="2022-01-19T20:00:24Z" level=debug msg="checking catsrc configmap state" id=Zfz5K source=test-catalog
2022-01-19T20:00:24.251445279Z stderr F time="2022-01-19T20:00:24Z" level=debug msg="check registry server healthy: true" id=Zfz5K source=test-catalog
2022-01-19T20:00:24.25145768Z stderr F time="2022-01-19T20:00:24Z" level=debug msg="registry state good" id=Zfz5K source=test-catalog
2022-01-19T20:00:28.931802007Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="Got source event: grpc.SourceState{Key:registry.CatalogKey{Name:\"test-catalog\", Namespace:\"subscription-e2e-gcqhv\"}, State:3}"
2022-01-19T20:00:28.931816007Z stderr F time="2022-01-19T20:00:28Z" level=info msg="state.Key.Namespace=subscription-e2e-gcqhv state.Key.Name=test-catalog state.State=TRANSIENT_FAILURE"
2022-01-19T20:00:28.931824208Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="syncing catsrc" id=j7VvG source=test-catalog
2022-01-19T20:00:28.931827808Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="checking catsrc configmap state" id=j7VvG source=test-catalog
2022-01-19T20:00:28.939247402Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="check registry server healthy: true" id=j7VvG source=test-catalog
2022-01-19T20:00:28.939260203Z stderr F time="2022-01-19T20:00:28Z" level=debug msg="registry state good" id=j7VvG source=test-catalog
2022-01-19T20:00:28.956912641Z stderr F I0119 20:00:28.955396       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"subscription-e2e-gcqhv", UID:"dfe83254-ba15-438f-badf-dd3b79c12036", APIVersion:"v1", ResourceVersion:"815", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' [error using catalog test-catalog (in namespace subscription-e2e-gcqhv): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.185.140:50051: connect: connection refused", error using catalog operatorhubio-catalog (in namespace operator-lifecycle-manager): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.27.145:50051: connect: connection refused"]
2022-01-19T20:00:28.956949443Z stderr F I0119 20:00:28.956796       1 event.go:282] Event(v1.ObjectReference{Kind:"Namespace", Namespace:"", Name:"subscription-e2e-gcqhv", UID:"dfe83254-ba15-438f-badf-dd3b79c12036", APIVersion:"v1", ResourceVersion:"815", FieldPath:""}): type: 'Warning' reason: 'ResolutionFailed' [error using catalog test-catalog (in namespace subscription-e2e-gcqhv): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.185.140:50051: connec\
t: connection refused", error using catalog operatorhubio-catalog (in namespace operator-lifecycle-manager): failed to list bundles: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.27.145:50051: connect: connec\
tion refused"]

This shows that the latest gRPC status is TRANSIENT_FAILURE but the status of the catalogsource is
check registry server healthy: true and registry state good.
Then the subscription is created and issue the list bundles request and failed.

The catalogsource sync has checks if the pod of the registry is up, the resources for the registry (service, service accout, role, rolebinding, etc) are OK.
It also has the gRPC status separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - I think that explanation sounds reasonable to me. In any case, this change is harmless so we can always re-open this issue if we misdiagnosed the root cause.

/approve
/lgtm

Expect(err).NotTo(HaveOccurred())

createSubscriptionForCatalog(ctx.Ctx().OperatorClient(), generatedNamespace.GetName(), "test-subscription", "test-catalog", "root", "channel-root", "", operatorsv1alpha1.ApprovalAutomatic)
})
Expand Down