-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Reconciliation error doesn't result in exponential backoff #5945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@fabriziopandini If possible, we should address this before the release. |
/milestone v1.1 |
@fabriziopandini: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@sbueringer and I debugged this issue and found that exponential backoff is not working when new conditions are added to the Cluster status. When a new condition is added reconciliation happens instantly. In the case of our reconcile-create functions we get a new name in the condition each time e.g. the auto-generated name of the template. This make this a new condition rather than an update to an old condition. It adds the new condition to the cluster status, which is an update event for reconciliation. This triggers instant reconciliation for some reason that appears to be behaviour in controller runtime. The expected behaviour before this observation was that this sort of update should trigger reconciliation with exponential backoff. That is still the desired behavior for situations like this. We have two approaches to solve this:
(2) Seems like the logical direction to go for the time being as it is a quick fix which resolves the bug in at least this context and improves (or at least does not disimprove 😄 ) the UX of the error reporting. Right now we're reporting the names of objects which are never actually created and with auto-generated names these objects will never be discoverable. It would be better to only report the deterministic part of the name in the error message. I'm going to go ahead and implement a fix to the error messages, but if there's any other approaches to fixing this it would be great to understand them. |
@killianmuldoon and @sbueringer thanks for having triaged this issue! |
Great - is the template cloned from annotation stable enough to use it in the error message like this? That's the easiest place to get an identifiable name for the object that's failed to be created. We could also pass the cloned-from object name in some other way. |
Just to make sure we're talking about the same. We need "identifiers" we can include in errors for:
I wonder what the ideal error messages would look like. Maybe something like:
What do you think? Note: some of our current errors have hard-coded "failed to update" (e.g. MD InfrastructureMachineTemplate) even though it could have been a create, so I would use "failed to reconcile" in those cases. |
I think we should keep this simple and just remove random generated names by replace KObj with KRef (failed to create object for ...) |
I'm not sure if that works in all cases, but I'm fine with alternatives and happy to just take a look at the PR :) |
The errors here are deeply wrapped and it's not trivial to unwrap them or sanitize them at the right level. I've opened a PR to add a predicate to the controller to get it to ignore status updates (using the generationChange predicate from controller runtime). It solves the direct issue, but I'd like to hear ideas/responses to this solution. |
When an error happens during Cluster reconciliation in the topology/cluster controller it doesn't result in an exponential backoff meaning the operation is continually retried even when the configuration is completely broken.
This makes it harder for users to understand the source of their issues, and CAPI is continually working to reconcile a completely broken config.
This was noticed when reproducing an issue with variable patching from: #5944
/area topology
/kind bug
The text was updated successfully, but these errors were encountered: