Skip to content

Unblock blocked repositories after test execution #61703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Aug 31, 2020

If a test fails before a repository has been unblocked it prevents
the clean up to proceed making all subsequent tests to fail during
the cleanup phase.

Closes #61541

If a test fails before a repository has been unblocked it prevents
the clean up to proceed making all subsequent tests to fail.

Closes elastic#61541
@fcofdez fcofdez added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v7.10.0 labels Aug 31, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of doing this for shared cluster but I have some doubts about the implementation.

@@ -101,6 +102,24 @@ protected Settings nodeSettings(int nodeOrdinal) {
return Arrays.asList(MockRepository.Plugin.class);
}

@After
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of this, but I'm a little suspicious that this is the right place to put this. Many tests don't reuse the test cluster in which case the work here is unnecessary? Maybe we should make this change in the logic that cleans up the test cluster when its reused instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that @After hooks from AbstractSnapshotIntgetTestCase are executed before the hooks in ESIntegTestCase. Is there a way to determine the scope of a cluster? we could bypass this cleanup if the cluster is not reused?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can do a simpler thing and simply skip the repo consistency checks in case of test failure via wrapping in:

if (getSuiteFailureMarker().wasSuccessful()) {

}

That should fix all the cases because I think closing the repos upstream in the EsIntegTestCase cleanup logic should deal with unblocking and removing all the repos cleanly.
That would also make test failures easier to interpret since we get rid of failed repo verifications on failed tests?

@fcofdez
Copy link
Contributor Author

fcofdez commented Aug 31, 2020

jenkins run elasticsearch-ci/packaging-sample-windows

Copy link
Contributor

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fcofdez looks good, just one question left :)

@@ -569,6 +583,27 @@ private void afterInternal(boolean afterClass) throws Exception {
}
}

public void unblockRepositories() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure about this. I think we should only run this if the original test failed shouldn't we? (this is what we do in REST tests)
Otherwise we can have tests leaking running/blocked snapshots in the background and are quietly cleaning them up here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. do we have something like getSuiteFailureMarker() but for a single test? I cannot find any method like that on ESTestCase

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe org.elasticsearch.test.ESTestCase#afterIfFailed will work here (I haven't checked the exact order of things I must admit). If it's called too late maybe we can do something with a @Rule? (I must admit I'm not an expert in JUnit so the latter is a bit of a guess).

@@ -2205,4 +2240,36 @@ public static String resolveCustomDataPath(String index) {
public static boolean inFipsJvm() {
return Boolean.parseBoolean(System.getProperty(FIPS_SYSPROP));
}

protected void awaitNoMoreSnapshotRunningOperations(String viaNode) throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: reorder this to natural word order awaitNoMoreRunningSnapshotOperations :)

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 3, 2020

jenkins test this

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 3, 2020

Unfortunately it seems like rules are executed after @After hooks in JUnit and that seems to be the only way to know when a test fails. @original-brownbear I've changed the approach a bit and now we track the blocked repositories and we just unblock those that are blocked after the test execution. Let me know what you think :)

@original-brownbear
Copy link
Contributor

@fcofdez sorry this kinda fell off of my radar :( I wonder if this can be dealt with in a very easy way now though because I found two things in the last two weeks:

  • Ensure MockRepository is Unblocked on Node Close #62711 we were accidentally not closing mock repositories so they didn't get unblocked on node restarts (that should cover a few cases of the issue here already I think?).
  • We actually have a method org.elasticsearch.test.ESTestCase#afterIfSuccessful on the test suite parent. Can't we simply do this:
diff --git a/test/framework/src/main/java/org/elasticsearch/snapshots/AbstractSnapshotIntegTestCase.java b/test/framework/src/main/java/org/elasticsearch/snapshots/AbstractSnapshotIntegTestCase.java
index cf09ca5cebf..2a835263db1 100644
--- a/test/framework/src/main/java/org/elasticsearch/snapshots/AbstractSnapshotIntegTestCase.java
+++ b/test/framework/src/main/java/org/elasticsearch/snapshots/AbstractSnapshotIntegTestCase.java
@@ -120,8 +120,9 @@ public abstract class AbstractSnapshotIntegTestCase extends ESIntegTestCase {
 
     private String skipRepoConsistencyCheckReason;
 
-    @After
-    public void assertRepoConsistency() {
+    @Override
+    public void afterIfSuccessful() throws Exception {
+        super.afterIfSuccessful();
         if (skipRepoConsistencyCheckReason == null) {
             clusterAdmin().prepareGetRepositories().get().repositories().forEach(repositoryMetadata -> {
                 final String name = repositoryMetadata.name();

and fix all our problems that way by simply not doing any repo health checks for failed tests?

@original-brownbear
Copy link
Contributor

Closing this I think this hasn't come up again and new tests generally seem to not use shared clusters so this has become somewhat irrelevant now I hope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SharedClusterSnapshotRestoreIT suite times out
10 participants