Remove extra registry pod when it is created #2614

akihikokuroda · 2022-02-03T20:40:14Z

Signed-off-by: akihikokuroda [email protected]

Description of the change:

This PR adds a mutex block around the creation of the configmap retistry pod. I reproduced this error locally a couple of times. I saw 2 registry pods created with the same configMapResourceVersion. The 2 queue process workers created the pod at the same time. This mutex block prevent 2 workers checking and creating the registry pod at the same time.

Motivation for the change:
Closes #2613
Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /doc
Commit messages sensible and descriptive

openshift-ci · 2022-02-03T20:40:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: akihikokuroda
To complete the pull request process, please assign dinhxuanvu after the PR has been reviewed.
You can assign the PR to them by writing /assign @dinhxuanvu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

akihikokuroda · 2022-02-04T00:08:49Z

/hold

akihikokuroda · 2022-02-04T03:03:50Z

/unhold

akihikokuroda · 2022-02-04T18:17:04Z

/hold
The mutex block doesn't seem enough because of the cached pod instances.

akihikokuroda · 2022-02-04T19:51:16Z

/unhold
Change the currentPod function not to use cache to get the list of pods.

njhale

Thanks for submitting this. I have a few questions/comments before we merge:

njhale · 2022-02-11T19:47:14Z

pkg/controller/registry/reconciler/configmap.go

@@ -166,6 +167,7 @@ type ConfigMapRegistryReconciler struct {
 	Lister   operatorlister.OperatorLister
 	OpClient operatorclient.ClientInterface
 	Image    string
+	MuPod    sync.RWMutex


I don't think this mutex needs to be exported.

It is taken out.

njhale · 2022-02-11T19:54:04Z

pkg/controller/registry/reconciler/configmap.go

@@ -354,6 +361,8 @@ func (c *ConfigMapRegistryReconciler) ensureRoleBinding(source configMapCatalogS

 func (c *ConfigMapRegistryReconciler) ensurePod(source configMapCatalogSourceDecorator, overwrite bool) error {
 	pod := source.Pod(c.Image)
+	c.MuPod.Lock()


It's not clear to me that adding a mutex solves the problem at hand. The queues that are used key on name/namespace and should never allow the same resource to be processed by more than one worker concurrently.

IMO, an alternative suggestion for the behavior you're seeing is that the resource was processed more than once, serially, before the cache that backs the Lister -- which you've swapped with a client call above -- can be updated. This could have lead to cache misses, and the resource being created again.

I would feel much more comfortable if you could produce a test, unit or e2e, that fails for the original code but passes with these changes.

Thanks for review. I saw 2 pods were created so I guessed the cause of it. I'll look into more detail and capture what is happening. I'm not sure if I can create a test to reproduce consistently but I'll try it anyway.

akihikokuroda · 2022-02-14T21:56:04Z

It must be the cache misses. Here is the catalog operator pod log when the CI test failed. The catsrc id=wEdnr and catsrc id=iFe3m ran back to back and the catsrc id=iFe3m saw the check registry server healthy: false after the ensured registry server by catsrc id=wEdnr

time="2022-01-28T18:39:08Z" level=debug msg="syncing catsrc" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="checking catsrc configmap state" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="check registry server healthy: false" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="ensuring registry server" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=info msg="syncing catalog source for annotation templates" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=ev0O2
time="2022-01-28T18:39:08Z" level=debug msg="this catalog source is not participating in template replacement" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=ev0O2
time="2022-01-28T18:39:08Z" level=debug msg="RemoveStatusConditions - request to remove status conditions did not result in any changes, so updates were not made" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=ev0O2
time="2022-01-28T18:39:08Z" level=debug msg="Got source event: grpc.SourceState{Key:registry.CatalogKey{Name:\"mock-ocs-main-w9dzd\", Namespace:\"operators\"}, State:1}"
time="2022-01-28T18:39:08Z" level=info msg="state.Key.Namespace=operators state.Key.Name=mock-ocs-main-w9dzd state.State=CONNECTING"
time="2022-01-28T18:39:08Z" level=debug msg="Got source event: grpc.SourceState{Key:registry.CatalogKey{Name:\"mock-ocs-main-w9dzd\", Namespace:\"operators\"}, State:3}"
time="2022-01-28T18:39:08Z" level=info msg="state.Key.Namespace=operators state.Key.Name=mock-ocs-main-w9dzd state.State=TRANSIENT_FAILURE"
time="2022-01-28T18:39:08Z" level=debug msg="ensured registry server" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="syncing catsrc" id=iFe3m source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="checking catsrc configmap state" id=iFe3m source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="check registry server healthy: false" id=iFe3m source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="ensuring registry server" id=iFe3m source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=info msg="syncing catalog source for annotation templates" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=sY5K/
time="2022-01-28T18:39:08Z" level=debug msg="this catalog source is not participating in template replacement" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=sY5K/
time="2022-01-28T18:39:08Z" level=debug msg="RemoveStatusConditions - request to remove status conditions did not result in any changes, so updates were not made" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=sY5K/
time="2022-01-28T18:39:08Z" level=debug msg="ensured registry server" id=iFe3m source=mock-ocs-main-w9dzd

I made change to not to use the Lister to check the registry pod in the currentPods and currentPodsWithCorrectResourceVersion. This should make the second sync to see the registry pod created.
I couldn't make the tests that causes this failure with the original code. It seems a small timing gap case.

ecordell · 2022-02-15T22:58:10Z

pkg/controller/registry/reconciler/configmap.go

@@ -214,28 +214,36 @@ func (c *ConfigMapRegistryReconciler) currentRoleBinding(source configMapCatalog

 func (c *ConfigMapRegistryReconciler) currentPods(source configMapCatalogSourceDecorator, image string) []*v1.Pod {
 	podName := source.Pod(image).GetName()
-	pods, err := c.Lister.CoreV1().PodLister().Pods(source.GetNamespace()).List(labels.SelectorFromSet(source.Selector()))
+	pods, err := c.OpClient.KubernetesInterface().CoreV1().Pods(source.GetNamespace()).List(context.TODO(), metav1.ListOptions{LabelSelector: labels.SelectorFromSet(source.Selector()).String()})


This may technically solve the issue, but if at all possible, talking directly to the kube api for read requests should be avoided.

The operator is already watching pods, so it will eventually know about all of these pods, and this just adds load to the apiserver.

Thanks for comments. OK. The second pod stays up until the next change of the catalogsource. I can probably change code to stop the second pod sooner without issuing the read requests instead preventing the second pod creation. Or if the multiple pod is not concern, I can change the e2e test not to check the single pod.

Signed-off-by: akihikokuroda <[email protected]>

perdasilva · 2024-02-19T13:33:43Z

closing PR as stale. Please re-open if it's still important.

openshift-ci bot requested a review from awgreene February 3, 2022 20:40

openshift-ci bot requested a review from timflannagan February 3, 2022 20:40

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2022

akihikokuroda force-pushed the registrypodcreation branch from 67589c8 to 0f2fa81 Compare February 4, 2022 01:55

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2022

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2022

akihikokuroda force-pushed the registrypodcreation branch from 0f2fa81 to 8c96a44 Compare February 4, 2022 19:49

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2022

njhale requested changes Feb 11, 2022

View reviewed changes

akihikokuroda force-pushed the registrypodcreation branch from 8c96a44 to 706f4c1 Compare February 14, 2022 21:31

akihikokuroda force-pushed the registrypodcreation branch from 706f4c1 to 1889f82 Compare February 14, 2022 22:49

akihikokuroda changed the title ~~put mutex block around registry pod creation~~ use kubernetes api to list pods instead of Lister in currentPods method Feb 15, 2022

ecordell reviewed Feb 15, 2022

View reviewed changes

akihikokuroda force-pushed the registrypodcreation branch from 1889f82 to f86b939 Compare February 16, 2022 17:51

akihikokuroda changed the title ~~use kubernetes api to list pods instead of Lister in currentPods method~~ Remove extra registry pod when it is created Feb 16, 2022

akihikokuroda force-pushed the registrypodcreation branch 3 times, most recently from dab650b to bacd936 Compare February 22, 2022 03:22

akihikokuroda force-pushed the registrypodcreation branch from bacd936 to e5ab894 Compare February 25, 2022 19:20

akihikokuroda force-pushed the registrypodcreation branch from e5ab894 to 1dc3d4c Compare March 7, 2022 01:21

remove extra registry pod when it is created

4c1003a

Signed-off-by: akihikokuroda <[email protected]>

akihikokuroda force-pushed the registrypodcreation branch from 1dc3d4c to 4c1003a Compare March 14, 2022 17:33

akihikokuroda mentioned this pull request Mar 14, 2022

e2e: Refactor the CatalogSource tests to reduce flakes #2693

Merged

7 tasks

perdasilva closed this Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove extra registry pod when it is created #2614

Remove extra registry pod when it is created #2614

akihikokuroda commented Feb 3, 2022

openshift-ci bot commented Feb 3, 2022

akihikokuroda commented Feb 4, 2022

akihikokuroda commented Feb 4, 2022

akihikokuroda commented Feb 4, 2022

akihikokuroda commented Feb 4, 2022

njhale left a comment

njhale Feb 11, 2022

akihikokuroda Feb 14, 2022

njhale Feb 11, 2022 •

edited

Loading

akihikokuroda Feb 11, 2022

akihikokuroda commented Feb 14, 2022

ecordell Feb 15, 2022

akihikokuroda Feb 16, 2022 •

edited

Loading

perdasilva commented Feb 19, 2024

Remove extra registry pod when it is created #2614

Remove extra registry pod when it is created #2614

Conversation

akihikokuroda commented Feb 3, 2022

openshift-ci bot commented Feb 3, 2022

akihikokuroda commented Feb 4, 2022

akihikokuroda commented Feb 4, 2022

akihikokuroda commented Feb 4, 2022

akihikokuroda commented Feb 4, 2022

njhale left a comment

Choose a reason for hiding this comment

njhale Feb 11, 2022

Choose a reason for hiding this comment

akihikokuroda Feb 14, 2022

Choose a reason for hiding this comment

njhale Feb 11, 2022 • edited Loading

Choose a reason for hiding this comment

akihikokuroda Feb 11, 2022

Choose a reason for hiding this comment

akihikokuroda commented Feb 14, 2022

ecordell Feb 15, 2022

Choose a reason for hiding this comment

akihikokuroda Feb 16, 2022 • edited Loading

Choose a reason for hiding this comment

perdasilva commented Feb 19, 2024

njhale Feb 11, 2022 •

edited

Loading

akihikokuroda Feb 16, 2022 •

edited

Loading