Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diagnostics: missing logging project shouldn't be fatal error #18714

Conversation

wozniakjan
Copy link

@wozniakjan wozniakjan commented Feb 22, 2018

Fix for #18705

This still reports missing logging project as an error, but not a fatal error

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Feb 22, 2018
@@ -149,7 +148,7 @@ func (d *AggregatedLogging) Requirements() (client bool, host bool) {
return true, false
}

func (d *AggregatedLogging) Complete(logger *log.Logger) error {
func (d *AggregatedLogging) selectProject() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Complete() is still a good place for this logic. Just don't return an error if the project is not found. Then in CanRun(), return false and the error if d.Project is still nil. That way you get a nice skip message, not an error.

Returning an error in Complete() should be reserved for truly unexpected error conditions that call into question the integrity of the source code. Use the logger to inform the user without halting execution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well; I see Complete() has been used to kick out bogus flag values too. The point is, execution stops right then if it returns an error. Be mindful that the user may be running all the diagnostics and would likely want to run the rest even if one is irrelevant.

@wozniakjan wozniakjan force-pushed the diagnostics/logging/18705/missing_project_fatal branch 3 times, most recently from d2c4e25 to 00f3654 Compare February 23, 2018 16:51
@sosiouxme
Copy link
Member

/lgtm
getting past tests is going to suck today :(

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 23, 2018
@sosiouxme
Copy link
Member

/retest

@@ -163,17 +163,20 @@ func (d *AggregatedLogging) Complete(logger *log.Logger) error {
d.Debug("AGL0032", fmt.Sprintf("Project %q not found", project))
continue
}
return fmt.Errorf("failed fetching one of the default logging projects %q: %v", project, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this error seems wrong. If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'? Returning error in this case is reasonable.
Other problem we need to deal with is when 'oadm diagnostics all' is called, one diag failure should not block execution of other diag.
I think the problem is in pkg/oc/admin/diagnostics/diagnostics.go

  • buildDiagnostics() can ignore diagnostic if Complete() fails (only capture the error)
  • RunDiagnostics can throw captured errors and run any valid diagnotics
func (o DiagnosticsOptions) RunDiagnostics() error {
    diagnostics, failure := o.buildDiagnostics()  // returns only valid diagnostics (ignores all diags where Complete() has failed)
    if failure != nil {
        // Log failure
    }
    if len(diagnostics) > 0 {
       return util.RunDiagnostics(o.Logger(), diagnostics)
    }
   return nil
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'?

You're right, I didn't inspect the logic closely enough here. It deals correctly with the projects not being there, but not any other error that could happen. Although I'd say the correct response here would be to log the error and return nil (so that other diagnostics could still run).

buildDiagnostics() can ignore diagnostic if Complete() fails (only capture the error)

I think you're probably right that this needs adjustment. The current "abort everything" Complete() behavior comes from it being run during the "diagnostics build" phase where anything that goes wrong is considered a critical error. I would like to think through how to clearly separate indicators of what exactly is wrong (the diagnostic runner, the individual diagnostic, its requirements, the user flags, the environment, the thing being diagnosed), all while making the default behavior as helpful as possible. It probably actually simplifies things that you can only run all or one now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this error seems wrong. If we fail to fetch the default logging project, we will incorrectly print 'Found default logging project...'? Returning error in this case is reasonable.

The way I understand Complete() is, just to finish the setup in best effort kind of way. The diagnostics are run in Check() and they will give the user the desired information about the state of the cluster. Keeping the error there will result in aborting all diagnostics if fetching one of default logging projects fails for some reason, removing it as suggested here will result in informing the user about the error in Check() but allowing other diagnostics to run as well.

I think the problem is in pkg/oc/admin/diagnostics/diagnostics.go

@sosiouxme should I try to change as @pravisankar suggests here or is it being done as part of #18709? It doesn't feel correct to refactor diagnostics in this PR and I would rather fix the broken behavior while still complying with the current implementation as best as we possibly can. And once the diagnostics implementation is changed, we can refactor this as well.

Unfortunately, we don't have another lead for the default logging project than to try a couple of namespaces, logging and newly openshift-logging. Another approach to logging diagnostics could be to run the Check() for each and if one is with no issues, consider diagnostics to be without issues, otherwise display issues for both.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wozniakjan your understanding seems correct to me. We just didn't want the Complete() logic to fall through to log "'Found default logging project...'" when really no project was found. To proceed with what you have now, just Complete() with an empty project and put this "failed fetching the logging project" error in CanRun() so the skip logic is engaged and other diagnostics proceed.

The decision on how to handle Complete() failures better is IMHO outside the scope of this PR. BTW #18709 doesn't address this at all, it's mainly a cosmetic reorg of code.

Copy link
Author

@wozniakjan wozniakjan Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! And about #18709, I wanted to clarify on what were the intentions there, in the name it said 'diagnostic reorg' so just wanted to make sure the PRs won't overlap in an undesired way

Updated the PR to return nil if fetching one of the default logging projects errors and also log the error so it is not completely lost (even though given the CanRun() skips because no logging project has been selected, it won't get to the user unless the diagnostics algorithm changes)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the current limitations, that is a decent approach. Thanks!

@pravisankar pravisankar removed the lgtm Indicates that a PR is ready to be merged. label Feb 23, 2018
@sosiouxme
Copy link
Member

sosiouxme commented Feb 23, 2018

I could add that just because a known logging project name exists, doesn't mean it's intended for logging. If the admin hasn't used it, anybody can create a logging or openshift-logging project and run whatever they want, and then diagnostics will complain about missing logging components forever. That's not a blocker as I don't know of any way to determine a project's intent, unless we establish an annotation for the logging project or something like that. And it's an edge case.

@wozniakjan wozniakjan force-pushed the diagnostics/logging/18705/missing_project_fatal branch from 00f3654 to c2d489d Compare March 6, 2018 12:01
@wozniakjan
Copy link
Author

Flake - Get https://registry.svc.ci.openshift.org/v2/ci/clonerefs/manifests/latest: dial tcp 35.184.16.217:443: i/o timeout.

/retest

@wozniakjan
Copy link
Author

this is a change in diagnostics and the diagnostics unit tests pass - https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18714/test_pull_request_origin_unit/10379/

PASS
coverage: 5.8% of statements
ok  	github.com/openshift/origin/pkg/oc/admin/diagnostics/diagnostics/cluster	2.342s	coverage: 5.8% of statements

PASS
coverage: 61.6% of statements
ok  	github.com/openshift/origin/pkg/oc/admin/diagnostics/diagnostics/cluster/aggregated_logging	2.299s	coverage: 61.6% of statements

fail is in jUnit, potential flake?

[WARNING] While the jUnit report found no failed tests, the `go test` process failed.
[WARNING] This usually means that the unit test suite failed to compile.

/retest

@sosiouxme
Copy link
Member

yep that's this flake: #18497

@sosiouxme
Copy link
Member

Looks like the error handling is fine now.

For a follow-up I'd like to note that if the user specifies a project that's not there, the message they get is "There was an error retrieving project '%s' which is most likely a transient error" -- it would be nice to check if that's a "not there" error and message accordingly.

But that need not block this, so for now...
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 6, 2018
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sosiouxme, wozniakjan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

Copy link

@pravisankar pravisankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -163,17 +163,20 @@ func (d *AggregatedLogging) Complete(logger *log.Logger) error {
d.Debug("AGL0032", fmt.Sprintf("Project %q not found", project))
continue
}
return fmt.Errorf("failed fetching one of the default logging projects %q: %v", project, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the current limitations, that is a decent approach. Thanks!

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue.

@openshift-merge-robot openshift-merge-robot merged commit 4aca54f into openshift:master Mar 6, 2018
@sosiouxme
Copy link
Member

👍 just needed your blessing @pravisankar

@sosiouxme
Copy link
Member

/cherrypick release-3.9

@openshift-cherrypick-robot

@sosiouxme: new pull request created: #18873

In response to this:

/cherrypick release-3.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. component/diagnostics lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants