Skip to content

📖 Runtime Hooks for Add-on management proposal #6418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 25, 2022

Conversation

ykakarap
Copy link
Contributor

What this PR does / why we need it:

Proposal to address #6374

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #6374

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 15, 2022
@ykakarap ykakarap changed the title 📖 Add Runtime Hooks for Addo-on management proposal 📖 Add Runtime Hooks for Add-on management proposal Apr 15, 2022
@ykakarap ykakarap force-pushed the proposal-runtime-hooks branch from a9c65ff to f70950f Compare April 15, 2022 02:30
Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/area runtime-sdk
/kind proposal

@k8s-ci-robot k8s-ci-robot added area/runtime-sdk Issues or PRs related to Runtime SDK kind/proposal Issues or PRs related to proposals. labels Apr 15, 2022
@ykakarap ykakarap changed the title 📖 Add Runtime Hooks for Add-on management proposal 📖 Runtime Hooks for Add-on management proposal Apr 15, 2022
@fabriziopandini
Copy link
Member

@killianmuldoon @ykakarap In order to gather some feedback let's ping the appointed reviewers (If someone else is interested, we should make sure to add them to this list as well).

Personally I'm looking forward to start experimenting with those webhooks
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 26, 2022
@ykakarap
Copy link
Contributor Author

ykakarap commented Apr 27, 2022

@CecileRobertMichon @enxebre @vincepri @sbueringer
When you find some time can you please review this proposal, that would be great! Thank you 🙂

@enxebre
Copy link
Member

enxebre commented May 4, 2022

Thanks @ykakarap! Generally I think this should be designed bottom-up. Before being able to reason about this high level use case and managed topologies I think is necessary to define hooks for core primitives i.e Cluster and Machine that are the foundation for everything else. Once that's clear we can extend and layer above organically to satisfy any use case.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 6, 2022
@vincepri
Copy link
Member

vincepri commented May 9, 2022

@enxebre The bottom-up approach is definitely something to consider in the fullness of time. Truthfully, there are some open questions on how would we achieve lower level hooks that need to take into account the overall cluster picture. That said, that sounds something that can be a different proposal altogether, given that having topology support for hooks and add-on management doesn't preclude having lower level hooks and integrate them later.

@fabriziopandini
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 11, 2022
@elmiko
Copy link
Contributor

elmiko commented May 11, 2022

agreed at the 11 May 2022 meeting, we are starting a 1 week lazy consensus today. if there are no objections it will merge on the 18th.

authors:
- "@killianmuldoon"
- "@ykakarap"
reviewers:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be great to also geat a review from @jackfrancis and/or @Jont828 since they are working on the add-on story via https://docs.google.com/document/d/1TdbfXC2_Hhg0mH7-7hXcT1Gg8h6oXKrKbnJqbpFFvjw/edit#heading=h.cbd41fsf3rhi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jackfrancis @Jont828 Would be great if you can take a look if you have some time, thx! :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @jackfrancis @Jont828 work could really benefit from lifecycle hook to manage edge cases like

  • addon to be upgraded before the cluster itself
  • addon to be upgraded right after the entire CP is upgraded
  • addon to be upgraded only after the entire cluster is upgraded
  • "blocking" cluster upgrade before an addon upgrade is completed
  • addon cleanup before the cluster deletion
    and probably more

Note: we are not yet there in the prototype work, but I guess we will get there soon

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good let me take a look!

Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed hooks look good to me, but it's unclear to me from reading this proposal what the relationship is between this and the Cluster Addons proposal which also seems to be moving forward. I also think we should make it more clear that these hooks don't have to be used for addon management, they could be leveraged for anything, even if the primary motivation of the proposal is addon management.

@sbueringer sbueringer mentioned this pull request May 12, 2022
30 tasks
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 13, 2022
Copy link
Contributor Author

@ykakarap ykakarap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CecileRobertMichon Thank you for the review.

I have addressed all the comments on the PR based on the discussions on the thread. PTAL.

@CecileRobertMichon
Copy link
Contributor

lgtm

@ykakarap please squash

@ykakarap ykakarap force-pushed the proposal-runtime-hooks branch from 295c087 to d9eed85 Compare May 13, 2022 23:46
@ykakarap
Copy link
Contributor Author

@CecileRobertMichon commits squashed.

@sbueringer
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 14, 2022
* A Runtime Extension should be extensively unit and e2e tested to ensure it behaves as expected.
* Users should be able to manually intervene and unblock the reconciliation.

As future work, we will explore more options like circuit breaker and timeout to unblock reconciliation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much extra scope would it be to add global (configurable) timeout enforcement to all runtime hooks? There are similar baked-in defaults to various k8s runtimes (20 minute context deadline wrappers, for example) that might be worth copying here to better ensure against forward progress deadlocks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the extra scope is the major issue here so much as user expectations - if users are setting RuntimeExtensions can we properly capture what users would expect from the system with a global timeout?

It's something we're thinking about, but it's hard to design a good solution without having the system in place (at least minimally) and getting a good understanding for how it might be used and how it will interact with the other processes / reconciliation loops going on in Cluster API.

Copy link
Member

@sbueringer sbueringer May 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion one problem is that this additional timeout does not make sense for all RuntimeHooks. While it makes sense for lifecycle hooks to specify a timeout after which the hooks are ignored, it wouldn't make sense for the topology mutation hook (external patching). So I think at this point it's hard to generalize this in the correct way.

It would be good to revisit this once we have more experience with the first implementation of the lifecycle hooks.

Additional notes:

  • admission webhooks today also don't have a timeout like this, there is a failure policy (which we also have) but apart from that requests will be just blocked if the webhook is broken
  • in the good case where a lifecycle hook works in general it is possible to implement a timeout inside of the hook, so that it only blocks at most for a certain period.

tl;dr I would prefer to defer introducing an additional timeout until we have some more experience with / feedback for lifecycle hooks (e.g. through the work on addon management)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thoughtful responses!

@jackfrancis
Copy link
Contributor

lgtm

@Jont828
Copy link
Contributor

Jont828 commented May 18, 2022

I think this proposal looks great, the design seems pretty clean and straightforward. I did have a question though. It mentions that this is designed specifically for clusters created from a ClusterClass. In that case, which of the hooks/features are compatible with non-ClusterClass clusters?

@killianmuldoon
Copy link
Contributor

@Jont828

I think this proposal looks great, the design seems pretty clean and straightforward. I did have a question though. It mentions that this is designed specifically for clusters created from a ClusterClass. In that case, which of the hooks/features are compatible with non-ClusterClass clusters?

The initial hooks for addon management defined here are only designed for ClusterClass managed Clusters. We definitely think that non-ClusterClass hooks are possible, and probably really useful, but it's hard to make them fit with the lifecycle events we're tracking and interacting with for add on managment

@fabriziopandini
Copy link
Member

/lgtm

@sbueringer
Copy link
Member

/lgtm

Should we merge given that the lazy consensus deadline was last Wednesday and we didn't have any blockers since then?

@Jont828
Copy link
Contributor

Jont828 commented May 24, 2022

@killianmuldoon Gotcha thanks for clarifying! Maybe non-clusterclass hooks could be something to look into later on down the line. But in any case this LGTM.

/lgtm

@fabriziopandini
Copy link
Member

fabriziopandini commented May 25, 2022

Lazy consensus deadline expired
/approve
/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabriziopandini

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 25, 2022
@k8s-ci-robot k8s-ci-robot merged commit e2e6c92 into kubernetes-sigs:main May 25, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.2 milestone May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/runtime-sdk Issues or PRs related to Runtime SDK cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/proposal Issues or PRs related to proposals. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFE: Runtime Hooks for Add-on Management