-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Informer stops receiving new events after some time #4781
Comments
Please check the logs for an indication that the informer has shutdown, or use the stopped CompletableFuture to register a handler onCompletion. Also check the number of informers / watches you have running - okhttp provides no feedback when the concurrent request limit is reached, and instead simply stops processing new requests. This is being addressed in 6.4. If the informer does not seem to be shutting down, nor have you exhausted the concurrent request limit we'll need a debug log or some other way of tracking down what you are seeing. |
@gyfora can you specify on which resource type this happened? |
This happened on a custom resource ( |
The debug logs are not very helpful, there are no errors from the informers. The most we see from fabric8 are the following:
I see these every once in a while for different namespaces and resource types. We don't see other logs when we don't get events, we only see that we didnt get the event received logs that we usually do :) |
Also the above mentioned logs are from fabric8 |
Likely implies that bookmarks aren't supported.
This should a be a full / fresh listing. It doesn't see any flinkdeployments at that time - is that expected / correct?
So by the logs the watch is running. Assuming there is another restart due to http gone, do you see the items exist at that time? Or like the previous logs is it always a 0 count? |
Sorry, this log was from a namespace which did not have any flinkdeployment resources. But other namespaces show the count:
However these restarts seem to happen randomly, not for all namespaces. Once a day or sometimes a little more frequently |
And prior to that there were events fro those deployments but they were not seen by the watch? If that's the case it does sound similar to kubernetes/kubernetes#102464 - where the api server would stop delivering events after a while. Ideally you should be able to reproduce this for a given resource / kube version outside of the fabric8 client - would it be possible to try something like kubernetes/kubernetes#102464 (comment) for your resource to confirm that there is an api server issue. Something that should help with this is #4675 - which would shorten the life of the watches - the proper fix was hard to implement given our reliance on mock expectations. I'll revisit it - it should be simpler to just have our watch manager just kill the watches after a while. However if the issue is not related to the watch timeout, this will not help.
It will take until the resourceVersion becomes "too old" which generally will be due to all modifications happening on the cluster which increment resourceVersion. |
@shawkins unfortunately it seems like our environment already contains the fix for the issue you linked. It would be really great if we had a simple way to just kill/reset the informers / watches periodically if they dont get any events. I know that doesn't really help with finding the root cause but this has a pretty big impact on us and we are not really sure how to debug it further. |
If you don't see any logs from the AbstractWatchManager or WatcherWebSocketListener about reconnecting in between the resource too old exceptions, then a very similar issue is at play even if it's not the same one.
#4675 will limit the amount of time a watch runs to 5 - 10 minutes, but obviously won't be immediately available. You can also a try to adjust your api server's watch timeout setting - see https://kubernetes.io/docs/reference/command-line-tools-reference/kube-apiserver/ --min-request-timeout. It defaults to 60 minutes, which means that fabric8 watches will run for 1-2 hours until they are terminated by kuberentes - so again if the watches are running longer than 2 hours and you don't see a server initiated reconnect, but no events, then it's definitely another api-server bug. Or, if it's not too much down time, you can always just terminate / restart your entire process every hour.
Perhaps just alter the go sample attached to the kuberenetes issue without a timeout, and see if / when it stops receiving events. |
#4675 should take care of this, as soon as 6.5.0 is available this issue should be solved. Tomorrow's SNAPSHOT build should contain the fix too. |
v6.5.0 was just released, could you please confirm if this is working now so we can close the issue |
Hello @gyfora |
We are in the process to upgrade to the newer version @MichaelVoegele so cannot confirm yet. It may be best if you try it yourself if you can |
Yep @gyfora, I just decided to do that and use v6.5.1. Cool, looking forward to see this fixed :-). Thanks guys for your awesome work with this fabric8 client! |
Describe the bug
We are running an operator built on top of the Java Operator SDK, which uses fabric8 6.2.0 internally.
In some cases the informers simply stop receiving new events from Kubernetes and therefore the CR changes are not picked up by our operator. There are no errors in the logs and it never recovers from this.
Restarting the operator process always solves the issue.
We have seen this more frequently in the past while still using fabric8 5.x.x but recently we also seen it with fabric8 6.2.0.
Fabric8 Kubernetes Client version
6.2.0
Steps to reproduce
Cannot be deterministacally reproduced.
Expected behavior
We would expect the informers to recover in these cases
Runtime
other (please specify in additional context)
Kubernetes API Server version
other (please specify in additional context)
Environment
other (please specify in additional context)
Fabric8 Kubernetes Client Logs
No response
Additional context
Kubernetes v1.20.15
The text was updated successfully, but these errors were encountered: