Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diagnostics: missing logging project shouldn't be fatal error #18714

Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -163,17 +163,22 @@ func (d *AggregatedLogging) Complete(logger *log.Logger) error {
d.Debug("AGL0032", fmt.Sprintf("Project %q not found", project))
continue
}
return fmt.Errorf("failed fetching one of the default logging projects %q: %v", project, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this error seems wrong. If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'? Returning error in this case is reasonable.
Other problem we need to deal with is when 'oadm diagnostics all' is called, one diag failure should not block execution of other diag.
I think the problem is in pkg/oc/admin/diagnostics/diagnostics.go

  • buildDiagnostics() can ignore diagnostic if Complete() fails (only capture the error)
  • RunDiagnostics can throw captured errors and run any valid diagnotics
func (o DiagnosticsOptions) RunDiagnostics() error {
    diagnostics, failure := o.buildDiagnostics()  // returns only valid diagnostics (ignores all diags where Complete() has failed)
    if failure != nil {
        // Log failure
    }
    if len(diagnostics) > 0 {
       return util.RunDiagnostics(o.Logger(), diagnostics)
    }
   return nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'?

You're right, I didn't inspect the logic closely enough here. It deals correctly with the projects not being there, but not any other error that could happen. Although I'd say the correct response here would be to log the error and return nil (so that other diagnostics could still run).

buildDiagnostics() can ignore diagnostic if Complete() fails (only capture the error)

I think you're probably right that this needs adjustment. The current "abort everything" Complete() behavior comes from it being run during the "diagnostics build" phase where anything that goes wrong is considered a critical error. I would like to think through how to clearly separate indicators of what exactly is wrong (the diagnostic runner, the individual diagnostic, its requirements, the user flags, the environment, the thing being diagnosed), all while making the default behavior as helpful as possible. It probably actually simplifies things that you can only run all or one now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this error seems wrong. If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'? Returning error in this case is reasonable.

The way I understand Complete() is, just to finish the setup in best effort kind of way. The diagnostics are run in Check() and they will give the user the desired information about the state of the cluster. Keeping the error there will result in aborting all diagnostics if fetching one of default logging projects fails for some reason, removing it as suggested here will result in informing the user about the error in Check() but allowing other diagnostics to run as well.

I think the problem is in pkg/oc/admin/diagnostics/diagnostics.go

@sosiouxme should I try to change as @pravisankar suggests here or is it being done as part of #18709? It doesn't feel correct to refactor diagnostics in this PR and I would rather fix the broken behavior while still complying with the current implementation as best as we possibly can. And once the diagnostics implementation is changed, we can refactor this as well.

Unfortunately, we don't have another lead for the default logging project than to try a couple of namespaces, logging and newly openshift-logging. Another approach to logging diagnostics could be to run the Check() for each and if one is with no issues, consider diagnostics to be without issues, otherwise display issues for both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wozniakjan your understanding seems correct to me. We just didn't want the Complete() logic to fall through to log "'Found default logging project...'" when really no project was found. To proceed with what you have now, just Complete() with an empty project and put this "failed fetching the logging project" error in CanRun() so the skip logic is engaged and other diagnostics proceed.

The decision on how to handle Complete() failures better is IMHO outside the scope of this PR. BTW #18709 doesn't address this at all, it's mainly a cosmetic reorg of code.

Copy link
Author

@wozniakjan wozniakjan Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! And about #18709, I wanted to clarify on what were the intentions there, in the name it said 'diagnostic reorg' so just wanted to make sure the PRs won't overlap in an undesired way

Updated the PR to return nil if fetching one of the default logging projects errors and also log the error so it is not completely lost (even though given the CanRun() skips because no logging project has been selected, it won't get to the user unless the diagnostics algorithm changes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the current limitations, that is a decent approach. Thanks!

d.Error("AGL0034", err, fmt.Sprintf("Fetching project %q returned with error", project))
return nil
}

d.Debug("AGL0033", fmt.Sprintf("Found default logging project %q", project))
d.Project = project
return nil
}
return fmt.Errorf("default logging project not found, use '--%s' to specify logging project", flagLoggingProject)
//tried to complete here but no known logging project exists, will be checked in CanRun()
return nil
}

func (d *AggregatedLogging) CanRun() (bool, error) {
if len(d.Project) == 0 {
return false, errors.New("Logging project does not exist")
}
if d.OAuthClientClient == nil || d.ProjectClient == nil || d.RouteClient == nil || d.CRBClient == nil || d.DCClient == nil {
return false, errors.New("Config must include a cluster-admin context to run this diagnostic")
}
Expand Down