Skip to content

Add Snapshot Resiliency Test for Master Failover during Delete #54866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

original-brownbear
Copy link
Contributor

We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like #54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in #54705
smaller.

We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like elastic#54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in elastic#54705
smaller.
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Apr 7, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

assertThat(finalSnapshotsInProgress.entries(), empty());
final Repository repository = randomMaster.repositoriesService.repository(repoName);
Collection<SnapshotId> snapshotIds = getRepositoryData(repository).getSnapshotIds();
assertThat(snapshotIds, either(hasSize(1)).or(hasSize(0)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be always hasSize(0) when waitForSnapshot is true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right :) I shouldn't have mindlessly copied that from the concurrent snapshots branch. Thanks for spotting :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of fact, thanks to recent fixes this is always 0. Even on master fail-over the deletes are now properly retried :) Adjusted tests accordingly. Interestingly enough, this created one strange spot for one, one in a million seed where the cleanup logic would take multiple minutes to complete on the fake threadpool (so it's just fake minutes) but still interesting => that's why I had to up the timeout there.
I'm investigating why that is now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tlrx This was a really stupid bug ... forgot to start the node connections service. This had some strange side effects since it resulted in some transport handlers never failing, causing some CS publications on the failing over master to never complete, causing this test to only move on once the failing master was again removed from the cluster after the 1.5m publication timeout ... behaves much better now.
Should be good for review with 3370c84 now :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) Thanks Armin

@original-brownbear original-brownbear requested a review from tlrx April 7, 2020 16:07
@@ -266,7 +267,7 @@ public void verifyReposThenStopServices() {
final AtomicBoolean cleanedUp = new AtomicBoolean(false);
continueOrDie(cleanupResponse, r -> cleanedUp.set(true));

runUntil(cleanedUp::get, TimeUnit.MINUTES.toMillis(1L));
runUntil(cleanedUp::get, TimeUnit.MINUTES.toMillis(5L));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm investigating why this is taking so long, until then upping the value here to because even though I found a seed that fails here now I bet there's one where this could fail in one of the other master failover tests.

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/1 (known task cancel test failure)

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/2

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

new NodeConnectionsService(clusterService.getSettings(), threadPool, transportService));
clusterService.getClusterApplierService().start();
clusterService.getClusterApplierService().setNodeConnectionsService(nodeConnectionsService);
nodeConnectionsService.start();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also call close in the stop method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea let's do it to prevent stray connection checks.

@@ -407,6 +408,49 @@ public void testSnapshotWithNodeDisconnects() {
assertThat(snapshotIds, hasSize(1));
}

public void testSnapshotDeleteWithMasterFailOvers() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failover

@original-brownbear original-brownbear merged commit db04fd2 into elastic:master Apr 8, 2020
@original-brownbear original-brownbear deleted the add-snapshot-delete-master-failover-test branch April 8, 2020 17:36
@original-brownbear
Copy link
Contributor Author

Thanks Yannick!

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 20, 2020
…ic#54866)

* Add Snapshot Resiliency Test for Master Failover during Delete

We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like elastic#54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in elastic#54705
smaller.
original-brownbear added a commit that referenced this pull request Apr 20, 2020
… (#55456)

* Add Snapshot Resiliency Test for Master Failover during Delete

We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like #54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in #54705
smaller.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test Issues or PRs that are addressing/adding tests v7.8.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants