Reconciliation error doesn't result in exponential backoff #5945

killianmuldoon · 2022-01-17T15:15:46Z

When an error happens during Cluster reconciliation in the topology/cluster controller it doesn't result in an exponential backoff meaning the operation is continually retried even when the configuration is completely broken.

This makes it harder for users to understand the source of their issues, and CAPI is continually working to reconcile a completely broken config.

This was noticed when reproducing an issue with variable patching from: #5944

/area topology
/kind bug

sbueringer · 2022-01-17T17:27:15Z

@fabriziopandini If possible, we should address this before the release.

fabriziopandini · 2022-01-17T17:27:27Z

/milestone v1.1
/priority release-blocking
It would be great to understand the underlying problem of this issue before v1.1

k8s-ci-robot · 2022-01-17T17:27:28Z

@fabriziopandini: The label(s) priority/release-blocking cannot be applied, because the repository doesn't have them.

In response to this:

/milestone v1.1
/priority release-blocking
It would be great to understand the underlying problem of this issue before v1.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

killianmuldoon · 2022-01-18T13:28:00Z

@sbueringer and I debugged this issue and found that exponential backoff is not working when new conditions are added to the Cluster status. When a new condition is added reconciliation happens instantly.

In the case of our reconcile-create functions we get a new name in the condition each time e.g. the auto-generated name of the template. This make this a new condition rather than an update to an old condition. It adds the new condition to the cluster status, which is an update event for reconciliation. This triggers instant reconciliation for some reason that appears to be behaviour in controller runtime. The expected behaviour before this observation was that this sort of update should trigger reconciliation with exponential backoff. That is still the desired behavior for situations like this.

We have two approaches to solve this:

Identify where in Controller runtime the condition is directed to cause instant reconciliation and remedy it, either with a change to the behaviour if it's seen to be a bug or a flag.
Update the topology reconciliation functions to not report the names of templates and objects with automatically generated names in error messages.

(2) Seems like the logical direction to go for the time being as it is a quick fix which resolves the bug in at least this context and improves (or at least does not disimprove 😄 ) the UX of the error reporting. Right now we're reporting the names of objects which are never actually created and with auto-generated names these objects will never be discoverable. It would be better to only report the deterministic part of the name in the error message.

I'm going to go ahead and implement a fix to the error messages, but if there's any other approaches to fixing this it would be great to understand them.

fabriziopandini · 2022-01-18T18:06:15Z

@killianmuldoon and @sbueringer thanks for having triaged this issue!
I agree that 2 is the most reasonable approach.
It does not really sense to have (random) object names in error messages if they were not actually created

killianmuldoon · 2022-01-18T18:32:13Z

Great - is the template cloned from annotation stable enough to use it in the error message like this? That's the easiest place to get an identifiable name for the object that's failed to be created. We could also pass the cloned-from object name in some other way.

sbueringer · 2022-01-19T11:21:51Z

Just to make sure we're talking about the same. We need "identifiers" we can include in errors for:

create MD & MHC
create InfraCluster & ControlPlane via reconcileReferencedObject
create InfrastructureMachineTemplate for ControlPlane via reconcileReferencedTemplate
create BootstrapTemplate & InfrastructureMachineTemplate for MD via reconcileReferencedTemplate

I wonder what the ideal error messages would look like. Maybe something like:

failed to create MachineDeployment/MachineHealthCheck for topology %s.
- we could get the topology name from the MD annotation
- the topology name is the only identifier which maps to the specific worker topology the user configured in the cluster
failed to create DockerCluster/KubeadmControlPlane
- there is only one InfraCluster / CP per cluster
failed to reconcile DockerMachineTemplate for ControlPlane
- we could create that error in reconcileControlPlane
- only propagate the CR error in reconcileReferencedTemplate
failed to reconcile DockerMachineTemplate/KubeadmConfigTemplate for MachineDeployment of topology %s
- we could get the topology name from the MD annotation
- the topology name is the only identifier which maps to the specific worker topology the user configured in the cluster
- we could create that error in createMachineDeployment/updateMachineDeployment
- only propagate the CR error in reconcileReferencedTemplate

What do you think?

Note: some of our current errors have hard-coded "failed to update" (e.g. MD InfrastructureMachineTemplate) even though it could have been a create, so I would use "failed to reconcile" in those cases.

fabriziopandini · 2022-01-19T16:07:30Z

I think we should keep this simple and just remove random generated names by replace KObj with KRef (failed to create object for ...)

sbueringer · 2022-01-19T16:11:43Z

I'm not sure if that works in all cases, but I'm fine with alternatives and happy to just take a look at the PR :)

killianmuldoon · 2022-01-19T19:24:59Z

The errors here are deeply wrapped and it's not trivial to unwrap them or sanitize them at the right level. I've opened a PR to add a predicate to the controller to get it to ignore status updates (using the generationChange predicate from controller runtime). It solves the direct issue, but I'd like to hear ideas/responses to this solution.

k8s-ci-robot added area/topology kind/bug Categorizes issue or PR as related to a bug. labels Jan 17, 2022

k8s-ci-robot added this to the v1.1 milestone Jan 17, 2022

fabriziopandini added the kind/release-blocking Issues or PRs that need to be closed before the next CAPI release label Jan 17, 2022

killianmuldoon mentioned this issue Jan 19, 2022

🐛 WIP: Stop cluster/topology reconciliation on Status changes #5956

Closed

killianmuldoon mentioned this issue Jan 21, 2022

🐛 Remove generated names from error messages to reduce reconciliation #5971

Merged

k8s-ci-robot closed this as completed in #5971 Jan 25, 2022

This was referenced Jul 14, 2022

Ensure errors from Runtime Extensions are sanitized before becoming conditions #6921

Closed

🐛 Remove message from Runtime SDK FailureResponses #6933

Merged

killianmuldoon mentioned this issue Oct 4, 2022

Patch UID in error causing infinite reconcile #7337

Closed

killianmuldoon added the area/clusterclass Issues or PRs related to clusterclass label May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconciliation error doesn't result in exponential backoff #5945

Reconciliation error doesn't result in exponential backoff #5945

killianmuldoon commented Jan 17, 2022 •

edited

Loading

sbueringer commented Jan 17, 2022

fabriziopandini commented Jan 17, 2022

k8s-ci-robot commented Jan 17, 2022

killianmuldoon commented Jan 18, 2022 •

edited

Loading

fabriziopandini commented Jan 18, 2022

killianmuldoon commented Jan 18, 2022

sbueringer commented Jan 19, 2022 •

edited

Loading

fabriziopandini commented Jan 19, 2022

sbueringer commented Jan 19, 2022

killianmuldoon commented Jan 19, 2022

Reconciliation error doesn't result in exponential backoff #5945

Reconciliation error doesn't result in exponential backoff #5945

Comments

killianmuldoon commented Jan 17, 2022 • edited Loading

sbueringer commented Jan 17, 2022

fabriziopandini commented Jan 17, 2022

k8s-ci-robot commented Jan 17, 2022

killianmuldoon commented Jan 18, 2022 • edited Loading

fabriziopandini commented Jan 18, 2022

killianmuldoon commented Jan 18, 2022

sbueringer commented Jan 19, 2022 • edited Loading

fabriziopandini commented Jan 19, 2022

sbueringer commented Jan 19, 2022

killianmuldoon commented Jan 19, 2022

killianmuldoon commented Jan 17, 2022 •

edited

Loading

killianmuldoon commented Jan 18, 2022 •

edited

Loading

sbueringer commented Jan 19, 2022 •

edited

Loading