-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diagnostics: missing logging project shouldn't be fatal error #18714
diagnostics: missing logging project shouldn't be fatal error #18714
Conversation
@@ -149,7 +148,7 @@ func (d *AggregatedLogging) Requirements() (client bool, host bool) { | |||
return true, false | |||
} | |||
|
|||
func (d *AggregatedLogging) Complete(logger *log.Logger) error { | |||
func (d *AggregatedLogging) selectProject() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Complete()
is still a good place for this logic. Just don't return an error if the project is not found. Then in CanRun()
, return false and the error if d.Project
is still nil
. That way you get a nice skip message, not an error.
Returning an error in Complete()
should be reserved for truly unexpected error conditions that call into question the integrity of the source code. Use the logger to inform the user without halting execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well; I see Complete()
has been used to kick out bogus flag values too. The point is, execution stops right then if it returns an error. Be mindful that the user may be running all the diagnostics and would likely want to run the rest even if one is irrelevant.
d2c4e25
to
00f3654
Compare
/lgtm |
/retest |
@@ -163,17 +163,20 @@ func (d *AggregatedLogging) Complete(logger *log.Logger) error { | |||
d.Debug("AGL0032", fmt.Sprintf("Project %q not found", project)) | |||
continue | |||
} | |||
return fmt.Errorf("failed fetching one of the default logging projects %q: %v", project, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this error seems wrong. If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'? Returning error in this case is reasonable.
Other problem we need to deal with is when 'oadm diagnostics all' is called, one diag failure should not block execution of other diag.
I think the problem is in pkg/oc/admin/diagnostics/diagnostics.go
- buildDiagnostics() can ignore diagnostic if Complete() fails (only capture the error)
- RunDiagnostics can throw captured errors and run any valid diagnotics
func (o DiagnosticsOptions) RunDiagnostics() error {
diagnostics, failure := o.buildDiagnostics() // returns only valid diagnostics (ignores all diags where Complete() has failed)
if failure != nil {
// Log failure
}
if len(diagnostics) > 0 {
return util.RunDiagnostics(o.Logger(), diagnostics)
}
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'?
You're right, I didn't inspect the logic closely enough here. It deals correctly with the projects not being there, but not any other error that could happen. Although I'd say the correct response here would be to log the error and return nil
(so that other diagnostics could still run).
buildDiagnostics() can ignore diagnostic if Complete() fails (only capture the error)
I think you're probably right that this needs adjustment. The current "abort everything" Complete()
behavior comes from it being run during the "diagnostics build" phase where anything that goes wrong is considered a critical error. I would like to think through how to clearly separate indicators of what exactly is wrong (the diagnostic runner, the individual diagnostic, its requirements, the user flags, the environment, the thing being diagnosed), all while making the default behavior as helpful as possible. It probably actually simplifies things that you can only run all or one now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this error seems wrong. If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'? Returning error in this case is reasonable.
The way I understand Complete()
is, just to finish the setup in best effort kind of way. The diagnostics are run in Check()
and they will give the user the desired information about the state of the cluster. Keeping the error there will result in aborting all diagnostics if fetching one of default logging projects fails for some reason, removing it as suggested here will result in informing the user about the error in Check()
but allowing other diagnostics to run as well.
I think the problem is in
pkg/oc/admin/diagnostics/diagnostics.go
@sosiouxme should I try to change as @pravisankar suggests here or is it being done as part of #18709? It doesn't feel correct to refactor diagnostics in this PR and I would rather fix the broken behavior while still complying with the current implementation as best as we possibly can. And once the diagnostics implementation is changed, we can refactor this as well.
Unfortunately, we don't have another lead for the default logging project than to try a couple of namespaces, logging
and newly openshift-logging
. Another approach to logging diagnostics could be to run the Check()
for each and if one is with no issues, consider diagnostics to be without issues, otherwise display issues for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wozniakjan your understanding seems correct to me. We just didn't want the Complete()
logic to fall through to log "'Found default logging project...'" when really no project was found. To proceed with what you have now, just Complete()
with an empty project and put this "failed fetching the logging project" error in CanRun()
so the skip logic is engaged and other diagnostics proceed.
The decision on how to handle Complete()
failures better is IMHO outside the scope of this PR. BTW #18709 doesn't address this at all, it's mainly a cosmetic reorg of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thanks! And about #18709, I wanted to clarify on what were the intentions there, in the name it said 'diagnostic reorg' so just wanted to make sure the PRs won't overlap in an undesired way
Updated the PR to return nil
if fetching one of the default logging projects errors and also log the error so it is not completely lost (even though given the CanRun()
skips because no logging project has been selected, it won't get to the user unless the diagnostics algorithm changes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the current limitations, that is a decent approach. Thanks!
I could add that just because a known logging project name exists, doesn't mean it's intended for logging. If the admin hasn't used it, anybody can create a |
00f3654
to
c2d489d
Compare
Flake - Get https://registry.svc.ci.openshift.org/v2/ci/clonerefs/manifests/latest: dial tcp 35.184.16.217:443: i/o timeout. /retest |
this is a change in diagnostics and the diagnostics unit tests pass - https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18714/test_pull_request_origin_unit/10379/
fail is in jUnit, potential flake?
/retest |
yep that's this flake: #18497 |
Looks like the error handling is fine now. For a follow-up I'd like to note that if the user specifies a project that's not there, the message they get is "There was an error retrieving project '%s' which is most likely a transient error" -- it would be nice to check if that's a "not there" error and message accordingly. But that need not block this, so for now... |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sosiouxme, wozniakjan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all [submit-queue is verifying that this PR is safe to merge] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -163,17 +163,20 @@ func (d *AggregatedLogging) Complete(logger *log.Logger) error { | |||
d.Debug("AGL0032", fmt.Sprintf("Project %q not found", project)) | |||
continue | |||
} | |||
return fmt.Errorf("failed fetching one of the default logging projects %q: %v", project, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the current limitations, that is a decent approach. Thanks!
Automatic merge from submit-queue. |
👍 just needed your blessing @pravisankar |
/cherrypick release-3.9 |
@sosiouxme: new pull request created: #18873 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fix for #18705
This still reports missing logging project as an error, but not a fatal error