Skip to content

Remove extra registry pod when it is created #2614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

akihikokuroda
Copy link
Member

Signed-off-by: akihikokuroda [email protected]

Description of the change:

This PR adds a mutex block around the creation of the configmap retistry pod. I reproduced this error locally a couple of times. I saw 2 registry pods created with the same configMapResourceVersion. The 2 queue process workers created the pod at the same time. This mutex block prevent 2 workers checking and creating the registry pod at the same time.

Motivation for the change:
Closes #2613
Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

@openshift-ci openshift-ci bot requested a review from awgreene February 3, 2022 20:40
@openshift-ci
Copy link

openshift-ci bot commented Feb 3, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: akihikokuroda
To complete the pull request process, please assign dinhxuanvu after the PR has been reviewed.
You can assign the PR to them by writing /assign @dinhxuanvu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot requested a review from timflannagan February 3, 2022 20:40
@akihikokuroda
Copy link
Member Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2022
@akihikokuroda
Copy link
Member Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2022
@akihikokuroda
Copy link
Member Author

/hold
The mutex block doesn't seem enough because of the cached pod instances.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2022
@akihikokuroda
Copy link
Member Author

/unhold
Change the currentPod function not to use cache to get the list of pods.

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 4, 2022
Copy link
Member

@njhale njhale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this. I have a few questions/comments before we merge:

@@ -166,6 +167,7 @@ type ConfigMapRegistryReconciler struct {
Lister operatorlister.OperatorLister
OpClient operatorclient.ClientInterface
Image string
MuPod sync.RWMutex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this mutex needs to be exported.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is taken out.

@@ -354,6 +361,8 @@ func (c *ConfigMapRegistryReconciler) ensureRoleBinding(source configMapCatalogS

func (c *ConfigMapRegistryReconciler) ensurePod(source configMapCatalogSourceDecorator, overwrite bool) error {
pod := source.Pod(c.Image)
c.MuPod.Lock()
Copy link
Member

@njhale njhale Feb 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me that adding a mutex solves the problem at hand. The queues that are used key on name/namespace and should never allow the same resource to be processed by more than one worker concurrently.

IMO, an alternative suggestion for the behavior you're seeing is that the resource was processed more than once, serially, before the cache that backs the Lister -- which you've swapped with a client call above -- can be updated. This could have lead to cache misses, and the resource being created again.

I would feel much more comfortable if you could produce a test, unit or e2e, that fails for the original code but passes with these changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for review. I saw 2 pods were created so I guessed the cause of it. I'll look into more detail and capture what is happening. I'm not sure if I can create a test to reproduce consistently but I'll try it anyway.

@akihikokuroda
Copy link
Member Author

It must be the cache misses. Here is the catalog operator pod log when the CI test failed. The catsrc id=wEdnr and catsrc id=iFe3m ran back to back and the catsrc id=iFe3m saw the check registry server healthy: false after the ensured registry server by catsrc id=wEdnr

time="2022-01-28T18:39:08Z" level=debug msg="syncing catsrc" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="checking catsrc configmap state" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="check registry server healthy: false" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="ensuring registry server" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=info msg="syncing catalog source for annotation templates" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=ev0O2
time="2022-01-28T18:39:08Z" level=debug msg="this catalog source is not participating in template replacement" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=ev0O2
time="2022-01-28T18:39:08Z" level=debug msg="RemoveStatusConditions - request to remove status conditions did not result in any changes, so updates were not made" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=ev0O2
time="2022-01-28T18:39:08Z" level=debug msg="Got source event: grpc.SourceState{Key:registry.CatalogKey{Name:\"mock-ocs-main-w9dzd\", Namespace:\"operators\"}, State:1}"
time="2022-01-28T18:39:08Z" level=info msg="state.Key.Namespace=operators state.Key.Name=mock-ocs-main-w9dzd state.State=CONNECTING"
time="2022-01-28T18:39:08Z" level=debug msg="Got source event: grpc.SourceState{Key:registry.CatalogKey{Name:\"mock-ocs-main-w9dzd\", Namespace:\"operators\"}, State:3}"
time="2022-01-28T18:39:08Z" level=info msg="state.Key.Namespace=operators state.Key.Name=mock-ocs-main-w9dzd state.State=TRANSIENT_FAILURE"
time="2022-01-28T18:39:08Z" level=debug msg="ensured registry server" id=wEdnr source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="syncing catsrc" id=iFe3m source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="checking catsrc configmap state" id=iFe3m source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="check registry server healthy: false" id=iFe3m source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=debug msg="ensuring registry server" id=iFe3m source=mock-ocs-main-w9dzd
time="2022-01-28T18:39:08Z" level=info msg="syncing catalog source for annotation templates" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=sY5K/
time="2022-01-28T18:39:08Z" level=debug msg="this catalog source is not participating in template replacement" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=sY5K/
time="2022-01-28T18:39:08Z" level=debug msg="RemoveStatusConditions - request to remove status conditions did not result in any changes, so updates were not made" catSrcName=mock-ocs-main-w9dzd catSrcNamespace=operators id=sY5K/
time="2022-01-28T18:39:08Z" level=debug msg="ensured registry server" id=iFe3m source=mock-ocs-main-w9dzd

I made change to not to use the Lister to check the registry pod in the currentPods and currentPodsWithCorrectResourceVersion. This should make the second sync to see the registry pod created.
I couldn't make the tests that causes this failure with the original code. It seems a small timing gap case.

@akihikokuroda akihikokuroda changed the title put mutex block around registry pod creation use kubernetes api to list pods instead of Lister in currentPods method Feb 15, 2022
@@ -214,28 +214,36 @@ func (c *ConfigMapRegistryReconciler) currentRoleBinding(source configMapCatalog

func (c *ConfigMapRegistryReconciler) currentPods(source configMapCatalogSourceDecorator, image string) []*v1.Pod {
podName := source.Pod(image).GetName()
pods, err := c.Lister.CoreV1().PodLister().Pods(source.GetNamespace()).List(labels.SelectorFromSet(source.Selector()))
pods, err := c.OpClient.KubernetesInterface().CoreV1().Pods(source.GetNamespace()).List(context.TODO(), metav1.ListOptions{LabelSelector: labels.SelectorFromSet(source.Selector()).String()})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may technically solve the issue, but if at all possible, talking directly to the kube api for read requests should be avoided.

The operator is already watching pods, so it will eventually know about all of these pods, and this just adds load to the apiserver.

Copy link
Member Author

@akihikokuroda akihikokuroda Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for comments. OK. The second pod stays up until the next change of the catalogsource. I can probably change code to stop the second pod sooner without issuing the read requests instead preventing the second pod creation. Or if the multiple pod is not concern, I can change the e2e test not to check the single pod.

@akihikokuroda akihikokuroda changed the title use kubernetes api to list pods instead of Lister in currentPods method Remove extra registry pod when it is created Feb 16, 2022
@akihikokuroda akihikokuroda force-pushed the registrypodcreation branch 3 times, most recently from dab650b to bacd936 Compare February 22, 2022 03:22
@perdasilva
Copy link
Collaborator

closing PR as stale. Please re-open if it's still important.

@perdasilva perdasilva closed this Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

e2e "config map update triggers registry pod rollout" failure
4 participants