-
Notifications
You must be signed in to change notification settings - Fork 1.4k
📖 Runtime Hooks for Add-on management proposal #6418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
📖 Runtime Hooks for Add-on management proposal #6418
Conversation
a9c65ff
to
f70950f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/area runtime-sdk
/kind proposal
@killianmuldoon @ykakarap In order to gather some feedback let's ping the appointed reviewers (If someone else is interested, we should make sure to add them to this list as well). Personally I'm looking forward to start experimenting with those webhooks |
@CecileRobertMichon @enxebre @vincepri @sbueringer |
Thanks @ykakarap! Generally I think this should be designed bottom-up. Before being able to reason about this high level use case and managed topologies I think is necessary to define hooks for core primitives i.e Cluster and Machine that are the foundation for everything else. Once that's clear we can extend and layer above organically to satisfy any use case. |
@enxebre The bottom-up approach is definitely something to consider in the fullness of time. Truthfully, there are some open questions on how would we achieve lower level hooks that need to take into account the overall cluster picture. That said, that sounds something that can be a different proposal altogether, given that having topology support for hooks and add-on management doesn't preclude having lower level hooks and integrate them later. |
/lgtm |
agreed at the 11 May 2022 meeting, we are starting a 1 week lazy consensus today. if there are no objections it will merge on the 18th. |
authors: | ||
- "@killianmuldoon" | ||
- "@ykakarap" | ||
reviewers: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great to also geat a review from @jackfrancis and/or @Jont828 since they are working on the add-on story via https://docs.google.com/document/d/1TdbfXC2_Hhg0mH7-7hXcT1Gg8h6oXKrKbnJqbpFFvjw/edit#heading=h.cbd41fsf3rhi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jackfrancis @Jont828 Would be great if you can take a look if you have some time, thx! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 @jackfrancis @Jont828 work could really benefit from lifecycle hook to manage edge cases like
- addon to be upgraded before the cluster itself
- addon to be upgraded right after the entire CP is upgraded
- addon to be upgraded only after the entire cluster is upgraded
- "blocking" cluster upgrade before an addon upgrade is completed
- addon cleanup before the cluster deletion
and probably more
Note: we are not yet there in the prototype work, but I guess we will get there soon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good let me take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proposed hooks look good to me, but it's unclear to me from reading this proposal what the relationship is between this and the Cluster Addons proposal which also seems to be moving forward. I also think we should make it more clear that these hooks don't have to be used for addon management, they could be leveraged for anything, even if the primary motivation of the proposal is addon management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CecileRobertMichon Thank you for the review.
I have addressed all the comments on the PR based on the discussions on the thread. PTAL.
lgtm @ykakarap please squash |
295c087
to
d9eed85
Compare
@CecileRobertMichon commits squashed. |
/lgtm |
* A Runtime Extension should be extensively unit and e2e tested to ensure it behaves as expected. | ||
* Users should be able to manually intervene and unblock the reconciliation. | ||
|
||
As future work, we will explore more options like circuit breaker and timeout to unblock reconciliation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much extra scope would it be to add global (configurable) timeout enforcement to all runtime hooks? There are similar baked-in defaults to various k8s runtimes (20 minute context deadline wrappers, for example) that might be worth copying here to better ensure against forward progress deadlocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the extra scope is the major issue here so much as user expectations - if users are setting RuntimeExtensions can we properly capture what users would expect from the system with a global timeout?
It's something we're thinking about, but it's hard to design a good solution without having the system in place (at least minimally) and getting a good understanding for how it might be used and how it will interact with the other processes / reconciliation loops going on in Cluster API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion one problem is that this additional timeout does not make sense for all RuntimeHooks. While it makes sense for lifecycle hooks to specify a timeout after which the hooks are ignored, it wouldn't make sense for the topology mutation hook (external patching). So I think at this point it's hard to generalize this in the correct way.
It would be good to revisit this once we have more experience with the first implementation of the lifecycle hooks.
Additional notes:
- admission webhooks today also don't have a timeout like this, there is a failure policy (which we also have) but apart from that requests will be just blocked if the webhook is broken
- in the good case where a lifecycle hook works in general it is possible to implement a timeout inside of the hook, so that it only blocks at most for a certain period.
tl;dr I would prefer to defer introducing an additional timeout until we have some more experience with / feedback for lifecycle hooks (e.g. through the work on addon management)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thoughtful responses!
lgtm |
I think this proposal looks great, the design seems pretty clean and straightforward. I did have a question though. It mentions that this is designed specifically for clusters created from a ClusterClass. In that case, which of the hooks/features are compatible with non-ClusterClass clusters? |
The initial hooks for addon management defined here are only designed for ClusterClass managed Clusters. We definitely think that non-ClusterClass hooks are possible, and probably really useful, but it's hard to make them fit with the lifecycle events we're tracking and interacting with for add on managment |
/lgtm |
/lgtm Should we merge given that the lazy consensus deadline was last Wednesday and we didn't have any blockers since then? |
@killianmuldoon Gotcha thanks for clarifying! Maybe non-clusterclass hooks could be something to look into later on down the line. But in any case this LGTM. /lgtm |
Lazy consensus deadline expired |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fabriziopandini The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Proposal to address #6374
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #6374