Skip to content

InformerEventSource cannot find resource after some time #1723

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gyfora opened this issue Jan 19, 2023 · 8 comments · Fixed by #1725 or #1726
Closed

InformerEventSource cannot find resource after some time #1723

gyfora opened this issue Jan 19, 2023 · 8 comments · Fixed by #1725 or #1726
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@gyfora
Copy link

gyfora commented Jan 19, 2023

Bug Report

What did you do?

We are using a simple label selector based informer in the Flink Kubernetes Operator: https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/EventSourceUtils.java#L45

It happened in some cases that after a while, the informer could not find the target object (Deployment) anymore, while it definitely existed in Kubernetes (verified manually). Restarting the operator solved the problem.

Based on this we suspect that the informer simply stopped receiving new events after a while and never recovered.

Environment

Josdk: 4.1.1
Java 11

@csviri
Copy link
Collaborator

csviri commented Jan 19, 2023

So if I understand correctly, it was before. It is not the case, that it never received the resource in the informer.

Checked but this part of the code is very simple on our side, basically just reading, reading the resource from the informer cache.
But will add some logging to make sure that it can be made sure its not in JOSDK.

@manusa @shawkins haven't you encountered this problem before?

@shawkins
Copy link
Collaborator

This likely was capture as fabric8io/kubernetes-client#4781 as well. We can work from the upstream side first based upon the comment over there.

@csviri
Copy link
Collaborator

csviri commented Jan 19, 2023

I discussed with @gyfora before, this seems to be a different issue. TBH I can't imaging how a resource is removed from the cache (ItemStore) without a delete event. But yep we can let's continue on fabric8 client side.

@csviri csviri linked a pull request Jan 19, 2023 that will close this issue
@shawkins
Copy link
Collaborator

TBH I can't imaging how a resource is removed from the cache (ItemStore) without a delete event. But yep we can let's continue on fabric8 client side.

It shouldn't be possible for it to have existed, then not exist without emitting a delete event - at least at the informer level. The only circumstances where an entry are removed are a delete event from the watch, and on a relist (which should be rare in an environment where bookmarks are supported) where the item no longer exists. Is it possible that the item was known / cached by the operator sdk and was never populated in the informer cache to begin with?

@csviri
Copy link
Collaborator

csviri commented Jan 19, 2023

Is it possible that the item was known / cached by the operator sdk and was never populated in the informer cache to begin with?

No that is not possible. JOSDK reads the Informer cache for resources. There is an another layer, mapping the resource between primary custom resource and secondary resource in this case (this is where I added logging). But if that was found before, also not possible.

@csviri csviri reopened this Jan 20, 2023
@csviri csviri linked a pull request Jan 20, 2023 that will close this issue
@csviri
Copy link
Collaborator

csviri commented Jan 24, 2023

I think we will need logs for this, i was not able to think about any scenarios where this could happen.

@csviri csviri closed this as completed Jan 24, 2023
@csviri csviri reopened this Jan 24, 2023
@csviri csviri self-assigned this Jan 27, 2023
@github-actions
Copy link

github-actions bot commented Apr 5, 2023

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2023
@github-actions
Copy link

This issue was closed because it has been stalled for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
3 participants