Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: change config webhook policy #3583

Open
wants to merge 1 commit into
base: release-1.35
Choose a base branch
from

Conversation

dsimansk
Copy link
Contributor

@dsimansk dsimansk commented Apr 2, 2025

Fixes JIRA #

Proposed Changes

@dsimansk
Copy link
Contributor Author

dsimansk commented Apr 2, 2025

/test all

@maschmid
Copy link
Contributor

maschmid commented Apr 2, 2025

Well, the webhook does not start without the configmap...

{"severity":"EMERGENCY","timestamp":"2025-04-02T18:05:38.461816322Z","logger":"webhook","caller":"sharedmain/main.go:284","message":"Failed to start configuration manager","commit":"7523ce2","knative.dev/pod":"webhook-6b9cbc9fc6-c8rqq","error":"configmap \"config-features\" not found","stacktrace":"knative.dev/pkg/injection/sharedmain.MainWithConfig\n\t/workspace/vendor/knative.dev/pkg/injection/sharedmain/main.go:284\nknative.dev/pkg/injection/sharedmain.MainWithContext\n\t/workspace/vendor/knative.dev/pkg/injection/sharedmain/main.go:209\nmain.main\n\t/workspace/cmd/webhook/main.go:172\nruntime.main\n\t/usr/lib/golang/src/runtime/proc.go:271"}

I guess we may instead either nuke the validatingwebhookconfiguration config.webhook.serving.knative.dev if it already exists during KnativeServing reconciliation but the webhook deployment (or the CMs?) don't ...

or, force it into the initial state as it's defined in the manifest (again, only if the webhook deployment (or the CMs? ) don't exist?)

(maybe not just not existing... basically whenever we need to touch the webhook deployment (that triggers a new pod to be created), we need to reset the config.webhook.serving.knative.dev validatingwebhookconfiguration (as the webhook will update it when it starts... ) ... and we need to reset it before we touch the configmaps, if we're touching them too...

@maschmid
Copy link
Contributor

maschmid commented Apr 2, 2025

alternatively, we may just drop the config.webhook.serving.knative.dev validatingwebhookconfiguration altogether, as we manage the CMs in the operator anyway? (and if we really need to control that, the SO operator should be the webhook that guards them then... )

(or, the logic of those validations could be done directly in the code that transforms KnativeServing .spec.config to confimaps, so the error could be directly handled in the operator... )

Copy link
Contributor

openshift-ci bot commented Apr 3, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dsimansk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Apr 3, 2025
@dsimansk dsimansk force-pushed the pr/release-1.35-tests-patch branch from f26ec6d to 464607d Compare April 3, 2025 11:59
@dsimansk dsimansk changed the title WIP: add ConfigMap to webhook dependant resources WIP: change config webhook policy Apr 3, 2025
@@ -9164,7 +9164,7 @@ webhooks:
- key: app.kubernetes.io/component
operator: In
values: ["autoscaler", "controller", "logging", "networking", "observability", "tracing", "net-certmanager"]
timeoutSeconds: 10
timeoutSeconds: 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to decrease it (so that the delay is not that long if the issue occurs? )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(10s is probably enough in normal cases where the webhook is working.. )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops my bad then, I have misinterpreted the meaning of this timeout.

@dsimansk
Copy link
Contributor Author

dsimansk commented Apr 3, 2025

@maschmid per slack thread. I've changed the failurePolicy and timeout on the validating webhook.

@dsimansk dsimansk force-pushed the pr/release-1.35-tests-patch branch from 464607d to 9d293ae Compare April 3, 2025 12:03
Copy link
Contributor

openshift-ci bot commented Apr 3, 2025

@dsimansk: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/417-test-upgrade-aws-417 9d293ae link true /test 417-test-upgrade-aws-417

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@skonto
Copy link
Contributor

skonto commented Apr 7, 2025

@maschmid @dsimansk In the upstream operator we can deploy the webhook deployments first and then wait for them to be up. Then apply the webhook cgfs and only after that the rest of the resources. This way even during corner cases there will be no resources that can cause the "webhook not found error". Ideally you don't want to create the cms without validation. Right now we only cover certificates but as I discussed @dsimansk if cms are required for the deployments to work then we need some strict order. The reason I didn't cover CMs upstream is because I thought people would remove stuff properly before re-isntalling even when S-O is used by other operators. The certificate issue was independent because it was happening upstream with a fresh install, making reconciliation to return early with an error (although transient). I will work on that fix upstream first.

@dsimansk
Copy link
Contributor Author

dsimansk commented Apr 7, 2025

@skonto ins't it already done in CheckDeployments? At least that was my impression from https://github.com/knative/operator/blob/main/pkg/reconciler/common/deployments.go#L48

@skonto
Copy link
Contributor

skonto commented Apr 8, 2025

@dsimansk Before we call checkDeployments here https://github.com/knative/operator/blob/main/pkg/reconciler/knativeserving/knativeserving.go#L126 we call manifest.Install https://github.com/knative/operator/blob/main/pkg/reconciler/common/install.go#L43 which installs all things except mutation, validation webhook cfgs and the certificate. My proposal would be:
a) Install_wbh deployments with the related webhook cms.
b) Wait for the wbh deployments
c) Install the rest of the manifest
d) Install the webhook cfg and the dependant resources

In step a) we can remove the wbh cfg before we start installing things to avoid the webhook failure as mentioned by @maschmid.

@skonto
Copy link
Contributor

skonto commented Apr 8, 2025

So far we are facing the following issues:
a) fresh install emits a warning due to the certificate being deployed too early. I fixed this upstream.
b) user does a non clean uninstall and webhook cgs are left around. This would fail with a re-install due to the certificate being deployed too early. Fixed with the PR for a)
c) user does a non clean uninstall and webhook cgs are left around. This would fail with a re-install due to cms being deployed too early.

I suggest we fix c) as commented above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants