-
Notifications
You must be signed in to change notification settings - Fork 7
feat: Add preflight checks framework #1129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Previously, helm variables were used for only the first webhook configuration.
Extend timeout to 30s
Run checks in parallel
Add unit tests
b768947
to
2c69dd0
Compare
Let checker decide whether it should run
Rename Checks to Init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for keeping the PR small and targeted!
pkg/webhook/preflight/preflight.go
Outdated
close(resultCh) | ||
|
||
// Collect check results. | ||
for result := range resultCh { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean, on any Error to run a Check or when a check returns a Allowed==false
the resp
will always be set to resp.Allowed = false
and will keep appending to the causes.
(just mainly thinking out loud)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. We want to keep track of whether we had a rejection, or an error, but collect all the check results, regardless.
Remove side effects from the initialization. That is, the checker initialization still decides which checks apply, but we defer side effects, and potential errors, to the checks themselves. This allows us to execute all checks that apply, and get the results to the client. Previously, if initialization failed, the checker returned no checks.
Address gocritic linter error
Each check returns list of causes
Do not wait for all checkers to initialize before running checks
Derive cause type from name in the check result
Add 'cluster' to webhook path and name
Handle create events only
Make Init godoc more clear
pkg/webhook/preflight/preflight.go
Outdated
checkerWG := &sync.WaitGroup{} | ||
resultCh := make(chan CheckResult) | ||
for _, checker := range h.checkers { | ||
checkerWG.Add(1) | ||
|
||
go func(ctx context.Context, checker Checker, resultCh chan CheckResult) { | ||
// Initialize the checker. | ||
checks := checker.Init(ctx, h.client, cluster) | ||
|
||
// Run its checks in parallel. | ||
checksWG := &sync.WaitGroup{} | ||
for _, check := range checks { | ||
checksWG.Add(1) | ||
go func(ctx context.Context, check Check, resultCh chan CheckResult) { | ||
result := check(ctx) | ||
resultCh <- result | ||
checksWG.Done() | ||
}(ctx, check, resultCh) | ||
} | ||
checksWG.Wait() | ||
|
||
checkerWG.Done() | ||
}(ctx, checker, resultCh) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the side effects of using channels like this is that results would be non-deterministically ordered, as well as check results from different checkers being interleaved. I would prefer deterministic ordering of results.
How about gathering all checks prior to running them (i.e. run checker.Init
for each checker in a loop to get all checks in order) and then run each check asynchronously sending check index back to channel along with result, gathering results in a heap/priority queue using the check index as the priority?
This would enable using a buffered channel for results (single loop with known number of checks to run) so non-blocking and return results in deterministic order, as well as keeping checks originating from a single checker together (non-interleaved).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! +1 to deterministically ordered results. (Update: Deterministically ordered results are not required for correctness, and don't matter to clients that interpret the results, but are easier to read for a human)
How about gathering all checks prior to running them (i.e. run checker.Init for each checker in a loop to get all checks in order)
Since checker.Init
may call out to an external API, I wanted to do initialization concurrently.
gathering results in a heap/priority queue using the check index as the priority?
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I need to initialize checkers concurrently, I end up needing to order by checker index and check index.
One priority queue won't work. I considered using one priority queue for checkers, and another for checks. I then decided to trade off space for complexity, and stored results in a two-dimensional slice, shared among all the goroutines. WDYT?
(Update: Sorry, my mistake. I could use one priority queue with two indices (checker, check)
. I still think the two-dimensional slice shared among all goroutines is a good solution. There's no synchronization required.)
pkg/webhook/preflight/preflight.go
Outdated
|
||
for _, cause := range result.Causes { | ||
resp.Result.Details.Causes = append(resp.Result.Details.Causes, metav1.StatusCause{ | ||
Type: metav1.CauseType(fmt.Sprintf("FailedPreflight%s", result.Name)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any concern around unbounded cardinality here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What concerns do you have? Do you want to impose a limit?
There is no documented limit to the number of causes that can be returned, and I find nothing in the API server's webhook dispatcher code, either.
pkg/webhook/preflight/preflight.go
Outdated
if len(resp.Result.Details.Causes) == 0 { | ||
return resp | ||
} | ||
|
||
// Because we have some causes, we construct the response message and code. | ||
resp.Result.Message = "preflight checks failed" | ||
resp.Result.Code = http.StatusForbidden | ||
resp.Result.Reason = metav1.StatusReasonForbidden | ||
if internalError { | ||
// Internal errors take precedence over check failures. | ||
resp.Result.Code = http.StatusInternalServerError | ||
resp.Result.Reason = metav1.StatusReasonInternalError | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't checking if causes is empty is enough here, does it handle the error response correctly? How about a switch instead to cover error and allowed state? Also think it should use StatusReasonInvalid
rather than StatusReasonForbidden
.
if len(resp.Result.Details.Causes) == 0 { | |
return resp | |
} | |
// Because we have some causes, we construct the response message and code. | |
resp.Result.Message = "preflight checks failed" | |
resp.Result.Code = http.StatusForbidden | |
resp.Result.Reason = metav1.StatusReasonForbidden | |
if internalError { | |
// Internal errors take precedence over check failures. | |
resp.Result.Code = http.StatusInternalServerError | |
resp.Result.Reason = metav1.StatusReasonInternalError | |
} | |
switch { | |
case internalError: | |
// Internal errors take precedence over check failures. | |
resp.Result.Code = http.StatusInternalServerError | |
resp.Result.Reason = metav1.StatusReasonInternalError | |
case !resp.Allowed: | |
// Because the response is not allowed, preflights must have failed. | |
resp.Result.Message = "preflight checks failed" | |
resp.Result.Code = http.StatusInvalid | |
resp.Result.Reason = metav1.StatusReasonInvalid | |
} | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, fixed!
pkg/webhook/preflight/preflight.go
Outdated
go func(wg *sync.WaitGroup, resultCh chan CheckResult) { | ||
wg.Wait() | ||
close(resultCh) | ||
}(checkerWG, resultCh) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can the function return before this go routing closes the channel?
what happens with slow checks?
(EDIT) goroutines will run even after the function returns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still would like to know how slow checks end up getting propogated up to the user!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced this code when I switched approaches, but your question is still valid. And it's a good question 😄
still would like to know how slow checks end up getting propogated up to the user!
The short answer is: they don't.
The webhook configuration has a configurable timeout, up to 30 seconds. The API server abandons the request when the time is up. No results are returned at all, and the user merely sees a "context deadline exceeded" error.
We want to return the results of some checks to the user, even if we can't complete all the checks in time. To do that, the framework will need to (a) impose its own timeout (and it must be shorter than the one configured for the webhook), (b) pre-empt checks that fail to return within the timeout.
Deterministically order results, and fix status reporting
Remove unnecessary copying of slice
What problem does this PR solve?:
Adds a framework for preflight checks. A preflight check is a type of validation that typically requires access to an infrastructure API.
A validating webhook on the Cluster resource executes all preflight checks, and returns failures, and warnings to the client.
Which issue(s) this PR fixes:
Fixes #
How Has This Been Tested?:
Special notes for your reviewer: