Skip to content

feat: Add preflight checks framework #1129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

dlipovetsky
Copy link
Contributor

@dlipovetsky dlipovetsky commented May 20, 2025

What problem does this PR solve?:
Adds a framework for preflight checks. A preflight check is a type of validation that typically requires access to an infrastructure API.

A validating webhook on the Cluster resource executes all preflight checks, and returns failures, and warnings to the client.

Which issue(s) this PR fixes:
Fixes #

How Has This Been Tested?:

Special notes for your reviewer:

Previously, helm variables were used for only the first webhook
configuration.
@dlipovetsky dlipovetsky force-pushed the dlipovetsky/preflight-checks-framework branch from b768947 to 2c69dd0 Compare May 20, 2025 21:06
Let checker decide whether it should run
Copy link
Contributor

@dkoshkin dkoshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping the PR small and targeted!

close(resultCh)

// Collect check results.
for result := range resultCh {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very clean, on any Error to run a Check or when a check returns a Allowed==false the resp will always be set to resp.Allowed = false and will keep appending to the causes.

(just mainly thinking out loud)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. We want to keep track of whether we had a rejection, or an error, but collect all the check results, regardless.

Remove side effects from the initialization. That is, the checker
initialization still decides which checks apply, but we defer side
effects, and potential errors, to the checks themselves. This allows us
to execute all checks that apply, and get the results to the client.

Previously, if initialization failed, the checker returned no checks.
@dlipovetsky
Copy link
Contributor Author

@dkoshkin Thanks for reviewing!

As I'm working on #1130, I'm making some changes here.

In hindsight, I should have marked this a draft PR. Sorry that I didn't. I won't force-push any changes here. I hope that will make it easier to review the new changes. 🙏

Address gocritic linter error
Each check returns list of causes
Do not wait for all checkers to initialize before running checks
Derive cause type from name in the check result
Add 'cluster' to webhook path and name
Comment on lines 85 to 108
checkerWG := &sync.WaitGroup{}
resultCh := make(chan CheckResult)
for _, checker := range h.checkers {
checkerWG.Add(1)

go func(ctx context.Context, checker Checker, resultCh chan CheckResult) {
// Initialize the checker.
checks := checker.Init(ctx, h.client, cluster)

// Run its checks in parallel.
checksWG := &sync.WaitGroup{}
for _, check := range checks {
checksWG.Add(1)
go func(ctx context.Context, check Check, resultCh chan CheckResult) {
result := check(ctx)
resultCh <- result
checksWG.Done()
}(ctx, check, resultCh)
}
checksWG.Wait()

checkerWG.Done()
}(ctx, checker, resultCh)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the side effects of using channels like this is that results would be non-deterministically ordered, as well as check results from different checkers being interleaved. I would prefer deterministic ordering of results.

How about gathering all checks prior to running them (i.e. run checker.Init for each checker in a loop to get all checks in order) and then run each check asynchronously sending check index back to channel along with result, gathering results in a heap/priority queue using the check index as the priority?

This would enable using a buffered channel for results (single loop with known number of checks to run) so non-blocking and return results in deterministic order, as well as keeping checks originating from a single checker together (non-interleaved).

Copy link
Contributor Author

@dlipovetsky dlipovetsky May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! +1 to deterministically ordered results. (Update: Deterministically ordered results are not required for correctness, and don't matter to clients that interpret the results, but are easier to read for a human)

How about gathering all checks prior to running them (i.e. run checker.Init for each checker in a loop to get all checks in order)

Since checker.Init may call out to an external API, I wanted to do initialization concurrently.

gathering results in a heap/priority queue using the check index as the priority?

Sounds good.

Copy link
Contributor Author

@dlipovetsky dlipovetsky May 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I need to initialize checkers concurrently, I end up needing to order by checker index and check index.

One priority queue won't work. I considered using one priority queue for checkers, and another for checks. I then decided to trade off space for complexity, and stored results in a two-dimensional slice, shared among all the goroutines. WDYT?

(Update: Sorry, my mistake. I could use one priority queue with two indices (checker, check). I still think the two-dimensional slice shared among all goroutines is a good solution. There's no synchronization required.)


for _, cause := range result.Causes {
resp.Result.Details.Causes = append(resp.Result.Details.Causes, metav1.StatusCause{
Type: metav1.CauseType(fmt.Sprintf("FailedPreflight%s", result.Name)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any concern around unbounded cardinality here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What concerns do you have? Do you want to impose a limit?

There is no documented limit to the number of causes that can be returned, and I find nothing in the API server's webhook dispatcher code, either.

Comment on lines 138 to 151
if len(resp.Result.Details.Causes) == 0 {
return resp
}

// Because we have some causes, we construct the response message and code.
resp.Result.Message = "preflight checks failed"
resp.Result.Code = http.StatusForbidden
resp.Result.Reason = metav1.StatusReasonForbidden
if internalError {
// Internal errors take precedence over check failures.
resp.Result.Code = http.StatusInternalServerError
resp.Result.Reason = metav1.StatusReasonInternalError
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't checking if causes is empty is enough here, does it handle the error response correctly? How about a switch instead to cover error and allowed state? Also think it should use StatusReasonInvalid rather than StatusReasonForbidden.

Suggested change
if len(resp.Result.Details.Causes) == 0 {
return resp
}
// Because we have some causes, we construct the response message and code.
resp.Result.Message = "preflight checks failed"
resp.Result.Code = http.StatusForbidden
resp.Result.Reason = metav1.StatusReasonForbidden
if internalError {
// Internal errors take precedence over check failures.
resp.Result.Code = http.StatusInternalServerError
resp.Result.Reason = metav1.StatusReasonInternalError
}
switch {
case internalError:
// Internal errors take precedence over check failures.
resp.Result.Code = http.StatusInternalServerError
resp.Result.Reason = metav1.StatusReasonInternalError
case !resp.Allowed:
// Because the response is not allowed, preflights must have failed.
resp.Result.Message = "preflight checks failed"
resp.Result.Code = http.StatusInvalid
resp.Result.Reason = metav1.StatusReasonInvalid
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed!

Comment on lines 111 to 114
go func(wg *sync.WaitGroup, resultCh chan CheckResult) {
wg.Wait()
close(resultCh)
}(checkerWG, resultCh)
Copy link
Contributor

@faiq faiq May 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the function return before this go routing closes the channel?

what happens with slow checks?

(EDIT) goroutines will run even after the function returns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still would like to know how slow checks end up getting propogated up to the user!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced this code when I switched approaches, but your question is still valid. And it's a good question 😄

still would like to know how slow checks end up getting propogated up to the user!

The short answer is: they don't.

The webhook configuration has a configurable timeout, up to 30 seconds. The API server abandons the request when the time is up. No results are returned at all, and the user merely sees a "context deadline exceeded" error.

We want to return the results of some checks to the user, even if we can't complete all the checks in time. To do that, the framework will need to (a) impose its own timeout (and it must be shorter than the one configured for the webhook), (b) pre-empt checks that fail to return within the timeout.

@dkoshkin dkoshkin self-requested a review May 29, 2025 19:24
Deterministically order results, and fix status reporting
Remove unnecessary copying of slice
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants