Skip to content

sdk: retry Watch() call on failure. #311

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

trusch
Copy link

@trusch trusch commented Jun 11, 2018

What

  • indefinitely retry to get a resource client in the watch call
  • add a mutex to save the informers global from concurrent access
  • move informer setup into own go-routine

Why

*What*
* indefinitely retry to get a resource client in the watch call
* add a mutex to save the `informers` global from concurrent access
* move informer setup into own go-routine
*Why*
* until now all operators will die when they are deployed before their CRD they are managing
* `informers` is a global variable in the sdk package. To make it safe for concurrent use it's now guarded by an mutex

fixes operator-framework#183
@spahl spahl requested a review from hasbro17 June 15, 2018 18:42
@hasbro17
Copy link
Contributor

@trusch I don't think we should retry if the Watch fails. The operator should just exit cleanly with an error instead of calling panic like we discussed on the issue #183 (comment)

There are a few problems with retrying like this:

  1. GetResourceClient() has other failure cases besides the CRD not being registered. For e.g the user can make an error by passing in an incorrect parameter (apiVersion, kind, or namespace) to Watch(). There's no point in retrying that.
  2. By retrying a failed Watch in the background you have the problem that the informer will get initialized later but will never Run because we only do a single pass of all initialized informers in sdk.Run(). So any informers created later on by retrying the Watch won't run.
  3. Since the operator is run as a Deployment it will naturally keep retrying if a Watch fails the first time and the operator pod exits cleanly.

So I would suggest we just have the operator exit cleanly so it's visible why the Watch failed and let the Deployment restart the operator.

@rithujohn191
Copy link
Contributor

Closing out this issue. We can re-open it if a need arises.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Operator panics if it starts before creating the CRD
3 participants