feat: Add preflight checks framework #1129

dlipovetsky · 2025-05-20T00:22:08Z

What problem does this PR solve?:
Adds a framework for preflight checks. A preflight check is a type of validation that typically requires access to an infrastructure API.

A validating webhook on the Cluster resource executes all preflight checks, and returns failures, and warnings to the client.

Which issue(s) this PR fixes:
Fixes #

How Has This Been Tested?:

Special notes for your reviewer:

Previously, helm variables were used for only the first webhook configuration.

Extend timeout to 30s

Run checks in parallel

Add unit tests

Let checker decide whether it should run

Rename Checks to Init

dkoshkin

Thanks for keeping the PR small and targeted!

cmd/main.go

dkoshkin · 2025-05-21T23:39:13Z

pkg/webhook/preflight/preflight.go

+	close(resultCh)
+
+	// Collect check results.
+	for result := range resultCh {


Very clean, on any Error to run a Check or when a check returns a Allowed==false the resp will always be set to resp.Allowed = false and will keep appending to the causes.

(just mainly thinking out loud)

That's right. We want to keep track of whether we had a rejection, or an error, but collect all the check results, regardless.

Remove side effects from the initialization. That is, the checker initialization still decides which checks apply, but we defer side effects, and potential errors, to the checks themselves. This allows us to execute all checks that apply, and get the results to the client. Previously, if initialization failed, the checker returned no checks.

dlipovetsky · 2025-05-22T00:05:20Z

@dkoshkin Thanks for reviewing!

As I'm working on #1130, I'm making some changes here.

In hindsight, I should have marked this a draft PR. Sorry that I didn't. I won't force-push any changes here. I hope that will make it easier to review the new changes. 🙏

Address gocritic linter error

Each check returns list of causes

Do not wait for all checkers to initialize before running checks

Derive cause type from name in the check result

Add 'cluster' to webhook path and name

Handle create events only

Make Init godoc more clear

jimmidyson · 2025-05-29T08:47:03Z

pkg/webhook/preflight/preflight.go

+	checkerWG := &sync.WaitGroup{}
+	resultCh := make(chan CheckResult)
+	for _, checker := range h.checkers {
+		checkerWG.Add(1)
+
+		go func(ctx context.Context, checker Checker, resultCh chan CheckResult) {
+			// Initialize the checker.
+			checks := checker.Init(ctx, h.client, cluster)
+
+			// Run its checks in parallel.
+			checksWG := &sync.WaitGroup{}
+			for _, check := range checks {
+				checksWG.Add(1)
+				go func(ctx context.Context, check Check, resultCh chan CheckResult) {
+					result := check(ctx)
+					resultCh <- result
+					checksWG.Done()
+				}(ctx, check, resultCh)
+			}
+			checksWG.Wait()
+
+			checkerWG.Done()
+		}(ctx, checker, resultCh)
+	}


One of the side effects of using channels like this is that results would be non-deterministically ordered, as well as check results from different checkers being interleaved. I would prefer deterministic ordering of results.

How about gathering all checks prior to running them (i.e. run checker.Init for each checker in a loop to get all checks in order) and then run each check asynchronously sending check index back to channel along with result, gathering results in a heap/priority queue using the check index as the priority?

This would enable using a buffered channel for results (single loop with known number of checks to run) so non-blocking and return results in deterministic order, as well as keeping checks originating from a single checker together (non-interleaved).

Good catch! +1 to deterministically ordered results. (Update: Deterministically ordered results are not required for correctness, and don't matter to clients that interpret the results, but are easier to read for a human)

How about gathering all checks prior to running them (i.e. run checker.Init for each checker in a loop to get all checks in order)

Since checker.Init may call out to an external API, I wanted to do initialization concurrently.

gathering results in a heap/priority queue using the check index as the priority?

Sounds good.

Because I need to initialize checkers concurrently, I end up needing to order by checker index and check index.

One priority queue won't work. I considered using one priority queue for checkers, and another for checks. I then decided to trade off space for complexity, and stored results in a two-dimensional slice, shared among all the goroutines. WDYT?

(Update: Sorry, my mistake. I could use one priority queue with two indices (checker, check). I still think the two-dimensional slice shared among all goroutines is a good solution. There's no synchronization required.)

jimmidyson · 2025-05-29T08:58:01Z

pkg/webhook/preflight/preflight.go

+
+		for _, cause := range result.Causes {
+			resp.Result.Details.Causes = append(resp.Result.Details.Causes, metav1.StatusCause{
+				Type:    metav1.CauseType(fmt.Sprintf("FailedPreflight%s", result.Name)),


Is there any concern around unbounded cardinality here?

What concerns do you have? Do you want to impose a limit?

There is no documented limit to the number of causes that can be returned, and I find nothing in the API server's webhook dispatcher code, either.

jimmidyson · 2025-05-29T09:12:54Z

pkg/webhook/preflight/preflight.go

+	if len(resp.Result.Details.Causes) == 0 {
+		return resp
+	}
+
+	// Because we have some causes, we construct the response message and code.
+	resp.Result.Message = "preflight checks failed"
+	resp.Result.Code = http.StatusForbidden
+	resp.Result.Reason = metav1.StatusReasonForbidden
+	if internalError {
+		// Internal errors take precedence over check failures.
+		resp.Result.Code = http.StatusInternalServerError
+		resp.Result.Reason = metav1.StatusReasonInternalError
+	}
+


I don't checking if causes is empty is enough here, does it handle the error response correctly? How about a switch instead to cover error and allowed state? Also think it should use StatusReasonInvalid rather than StatusReasonForbidden.

Suggested change

if len(resp.Result.Details.Causes) == 0 {

return resp

}

// Because we have some causes, we construct the response message and code.

resp.Result.Message = "preflight checks failed"

resp.Result.Code = http.StatusForbidden

resp.Result.Reason = metav1.StatusReasonForbidden

if internalError {

// Internal errors take precedence over check failures.

resp.Result.Code = http.StatusInternalServerError

resp.Result.Reason = metav1.StatusReasonInternalError

}

switch {

case internalError:

// Internal errors take precedence over check failures.

resp.Result.Code = http.StatusInternalServerError

resp.Result.Reason = metav1.StatusReasonInternalError

case !resp.Allowed:

// Because the response is not allowed, preflights must have failed.

resp.Result.Message = "preflight checks failed"

resp.Result.Code = http.StatusInvalid

resp.Result.Reason = metav1.StatusReasonInvalid

}

Thanks, fixed!

faiq · 2025-05-29T13:39:18Z

pkg/webhook/preflight/preflight.go

+	go func(wg *sync.WaitGroup, resultCh chan CheckResult) {
+		wg.Wait()
+		close(resultCh)
+	}(checkerWG, resultCh)


can the function return before this go routing closes the channel?

what happens with slow checks?

(EDIT) goroutines will run even after the function returns

still would like to know how slow checks end up getting propogated up to the user!

I replaced this code when I switched approaches, but your question is still valid. And it's a good question 😄

still would like to know how slow checks end up getting propogated up to the user!

The short answer is: they don't.

The webhook configuration has a configurable timeout, up to 30 seconds. The API server abandons the request when the time is up. No results are returned at all, and the user merely sees a "context deadline exceeded" error.

We want to return the results of some checks to the user, even if we can't complete all the checks in time. To do that, the framework will need to (a) impose its own timeout (and it must be shorter than the one configured for the webhook), (b) pre-empt checks that fail to return within the timeout.

Deterministically order results, and fix status reporting

Remove unnecessary copying of slice

dlipovetsky added 2 commits May 19, 2025 16:42

fix: Use helm variables for all webhook configurations

7511f49

Previously, helm variables were used for only the first webhook configuration.

feat: Add preflight checks framework

a40fd92

github-actions bot added feature and removed feature labels May 20, 2025

dlipovetsky mentioned this pull request May 20, 2025

feat: Nutanix VM image preflight check #1130

Draft

dlipovetsky added 3 commits May 20, 2025 11:27

fixup! feat: Add preflight checks framework

24c9d45

Extend timeout to 30s

fixup! feat: Add preflight checks framework

858e6fd

Run checks in parallel

fixup! feat: Add preflight checks framework

2c69dd0

Add unit tests

dlipovetsky force-pushed the dlipovetsky/preflight-checks-framework branch from b768947 to 2c69dd0 Compare May 20, 2025 21:06

dlipovetsky added 2 commits May 20, 2025 16:07

fixup! feat: Add preflight checks framework

7e687a2

Let checker decide whether it should run

fixup! feat: Add preflight checks framework

306a852

Rename Checks to Init

dkoshkin approved these changes May 21, 2025

View reviewed changes

dlipovetsky added 7 commits May 21, 2025 17:38

fixup! feat: Add preflight checks framework

690a89e

Address gocritic linter error

fixup! feat: Add preflight checks framework

47c6ad8

Each check returns list of causes

fixup! feat: Add preflight checks framework

3211e7a

Do not wait for all checkers to initialize before running checks

fixup! feat: Add preflight checks framework

5eb6d26

Derive cause type from name in the check result

fixup! feat: Add preflight checks framework

4a518b3

Add 'cluster' to webhook path and name

fixup! feat: Add preflight checks framework

438495b

Handle create events only

fixup! feat: Add preflight checks framework

b377301

Make Init godoc more clear

jimmidyson reviewed May 29, 2025

View reviewed changes

faiq reviewed May 29, 2025

View reviewed changes

dkoshkin self-requested a review May 29, 2025 19:24

dlipovetsky added 2 commits May 29, 2025 18:10

fixup! feat: Add preflight checks framework

ec20c1e

Deterministically order results, and fix status reporting

fixup! feat: Add preflight checks framework

d766f3a

Remove unnecessary copying of slice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add preflight checks framework #1129

feat: Add preflight checks framework #1129

Uh oh!

dlipovetsky commented May 20, 2025 •

edited

Loading

Uh oh!

dkoshkin left a comment

Uh oh!

Uh oh!

dkoshkin May 21, 2025

Uh oh!

dlipovetsky May 23, 2025

Uh oh!

dlipovetsky commented May 22, 2025

Uh oh!

jimmidyson May 29, 2025

Uh oh!

dlipovetsky May 29, 2025 •

edited

Loading

Uh oh!

dlipovetsky May 30, 2025 •

edited

Loading

Uh oh!

jimmidyson May 29, 2025

Uh oh!

dlipovetsky May 29, 2025

Uh oh!

jimmidyson May 29, 2025

Uh oh!

dlipovetsky May 30, 2025

Uh oh!

faiq May 29, 2025 •

edited

Loading

Uh oh!

faiq May 30, 2025

Uh oh!

dlipovetsky May 30, 2025

Uh oh!

Uh oh!

feat: Add preflight checks framework #1129

Are you sure you want to change the base?

feat: Add preflight checks framework #1129

Uh oh!

Conversation

dlipovetsky commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dkoshkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlipovetsky commented May 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlipovetsky May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlipovetsky May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

faiq May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dlipovetsky commented May 20, 2025 •

edited

Loading

dlipovetsky May 29, 2025 •

edited

Loading

dlipovetsky May 30, 2025 •

edited

Loading

faiq May 29, 2025 •

edited

Loading