Skip to content

Reconciliation error doesn't result in exponential backoff #5945

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
killianmuldoon opened this issue Jan 17, 2022 · 10 comments · Fixed by #5971
Closed

Reconciliation error doesn't result in exponential backoff #5945

killianmuldoon opened this issue Jan 17, 2022 · 10 comments · Fixed by #5971
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release
Milestone

Comments

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Jan 17, 2022

When an error happens during Cluster reconciliation in the topology/cluster controller it doesn't result in an exponential backoff meaning the operation is continually retried even when the configuration is completely broken.

This makes it harder for users to understand the source of their issues, and CAPI is continually working to reconcile a completely broken config.

This was noticed when reproducing an issue with variable patching from: #5944

/area topology
/kind bug

@k8s-ci-robot k8s-ci-robot added area/topology kind/bug Categorizes issue or PR as related to a bug. labels Jan 17, 2022
@sbueringer
Copy link
Member

@fabriziopandini If possible, we should address this before the release.

@fabriziopandini
Copy link
Member

/milestone v1.1
/priority release-blocking
It would be great to understand the underlying problem of this issue before v1.1

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The label(s) priority/release-blocking cannot be applied, because the repository doesn't have them.

In response to this:

/milestone v1.1
/priority release-blocking
It would be great to understand the underlying problem of this issue before v1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added this to the v1.1 milestone Jan 17, 2022
@fabriziopandini fabriziopandini added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Jan 17, 2022
@killianmuldoon
Copy link
Contributor Author

killianmuldoon commented Jan 18, 2022

@sbueringer and I debugged this issue and found that exponential backoff is not working when new conditions are added to the Cluster status. When a new condition is added reconciliation happens instantly.

In the case of our reconcile-create functions we get a new name in the condition each time e.g. the auto-generated name of the template. This make this a new condition rather than an update to an old condition. It adds the new condition to the cluster status, which is an update event for reconciliation. This triggers instant reconciliation for some reason that appears to be behaviour in controller runtime. The expected behaviour before this observation was that this sort of update should trigger reconciliation with exponential backoff. That is still the desired behavior for situations like this.

We have two approaches to solve this:

  1. Identify where in Controller runtime the condition is directed to cause instant reconciliation and remedy it, either with a change to the behaviour if it's seen to be a bug or a flag.
  2. Update the topology reconciliation functions to not report the names of templates and objects with automatically generated names in error messages.

(2) Seems like the logical direction to go for the time being as it is a quick fix which resolves the bug in at least this context and improves (or at least does not disimprove 😄 ) the UX of the error reporting. Right now we're reporting the names of objects which are never actually created and with auto-generated names these objects will never be discoverable. It would be better to only report the deterministic part of the name in the error message.

I'm going to go ahead and implement a fix to the error messages, but if there's any other approaches to fixing this it would be great to understand them.

@fabriziopandini
Copy link
Member

@killianmuldoon and @sbueringer thanks for having triaged this issue!
I agree that 2 is the most reasonable approach.
It does not really sense to have (random) object names in error messages if they were not actually created

@killianmuldoon
Copy link
Contributor Author

Great - is the template cloned from annotation stable enough to use it in the error message like this? That's the easiest place to get an identifiable name for the object that's failed to be created. We could also pass the cloned-from object name in some other way.

@sbueringer
Copy link
Member

sbueringer commented Jan 19, 2022

Just to make sure we're talking about the same. We need "identifiers" we can include in errors for:

  1. create MD & MHC
  2. create InfraCluster & ControlPlane via reconcileReferencedObject
  3. create InfrastructureMachineTemplate for ControlPlane via reconcileReferencedTemplate
  4. create BootstrapTemplate & InfrastructureMachineTemplate for MD via reconcileReferencedTemplate

I wonder what the ideal error messages would look like. Maybe something like:

  1. failed to create MachineDeployment/MachineHealthCheck for topology %s.
    • we could get the topology name from the MD annotation
    • the topology name is the only identifier which maps to the specific worker topology the user configured in the cluster
  2. failed to create DockerCluster/KubeadmControlPlane
    • there is only one InfraCluster / CP per cluster
  3. failed to reconcile DockerMachineTemplate for ControlPlane
    • we could create that error in reconcileControlPlane
    • only propagate the CR error in reconcileReferencedTemplate
  4. failed to reconcile DockerMachineTemplate/KubeadmConfigTemplate for MachineDeployment of topology %s
    • we could get the topology name from the MD annotation
    • the topology name is the only identifier which maps to the specific worker topology the user configured in the cluster
    • we could create that error in createMachineDeployment/updateMachineDeployment
    • only propagate the CR error in reconcileReferencedTemplate

What do you think?

Note: some of our current errors have hard-coded "failed to update" (e.g. MD InfrastructureMachineTemplate) even though it could have been a create, so I would use "failed to reconcile" in those cases.

@fabriziopandini
Copy link
Member

I think we should keep this simple and just remove random generated names by replace KObj with KRef (failed to create object for ...)

@sbueringer
Copy link
Member

I'm not sure if that works in all cases, but I'm fine with alternatives and happy to just take a look at the PR :)

@killianmuldoon
Copy link
Contributor Author

The errors here are deeply wrapped and it's not trivial to unwrap them or sanitize them at the right level. I've opened a PR to add a predicate to the controller to get it to ignore status updates (using the generationChange predicate from controller runtime). It solves the direct issue, but I'd like to hear ideas/responses to this solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug. kind/release-blocking Issues or PRs that need to be closed before the next CAPI release
Projects
None yet
4 participants