-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Add Snapshot Resiliency Test for Master Failover during Delete #54866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Snapshot Resiliency Test for Master Failover during Delete #54866
Conversation
We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like elastic#54256 and makes the diff to the upcoming more consistent snapshot delete implementation in elastic#54705 smaller.
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
assertThat(finalSnapshotsInProgress.entries(), empty()); | ||
final Repository repository = randomMaster.repositoriesService.repository(repoName); | ||
Collection<SnapshotId> snapshotIds = getRepositoryData(repository).getSnapshotIds(); | ||
assertThat(snapshotIds, either(hasSize(1)).or(hasSize(0))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be always hasSize(0) when waitForSnapshot is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right :) I shouldn't have mindlessly copied that from the concurrent snapshots branch. Thanks for spotting :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of fact, thanks to recent fixes this is always 0
. Even on master fail-over the deletes are now properly retried :) Adjusted tests accordingly. Interestingly enough, this created one strange spot for one, one in a million seed where the cleanup logic would take multiple minutes to complete on the fake threadpool (so it's just fake minutes) but still interesting => that's why I had to up the timeout there.
I'm investigating why that is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tlrx This was a really stupid bug ... forgot to start the node connections service. This had some strange side effects since it resulted in some transport handlers never failing, causing some CS publications on the failing over master to never complete, causing this test to only move on once the failing master was again removed from the cluster after the 1.5m publication timeout ... behaves much better now.
Should be good for review with 3370c84 now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :) Thanks Armin
@@ -266,7 +267,7 @@ public void verifyReposThenStopServices() { | |||
final AtomicBoolean cleanedUp = new AtomicBoolean(false); | |||
continueOrDie(cleanupResponse, r -> cleanedUp.set(true)); | |||
|
|||
runUntil(cleanedUp::get, TimeUnit.MINUTES.toMillis(1L)); | |||
runUntil(cleanedUp::get, TimeUnit.MINUTES.toMillis(5L)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm investigating why this is taking so long, until then upping the value here to because even though I found a seed that fails here now I bet there's one where this could fail in one of the other master failover tests.
Jenkins run elasticsearch-ci/1 (known task cancel test failure) |
Jenkins run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
new NodeConnectionsService(clusterService.getSettings(), threadPool, transportService)); | ||
clusterService.getClusterApplierService().start(); | ||
clusterService.getClusterApplierService().setNodeConnectionsService(nodeConnectionsService); | ||
nodeConnectionsService.start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also call close in the stop method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea let's do it to prevent stray connection checks.
@@ -407,6 +408,49 @@ public void testSnapshotWithNodeDisconnects() { | |||
assertThat(snapshotIds, hasSize(1)); | |||
} | |||
|
|||
public void testSnapshotDeleteWithMasterFailOvers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failover
…e-master-failover-test
Thanks Yannick! |
…ic#54866) * Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like elastic#54256 and makes the diff to the upcoming more consistent snapshot delete implementation in elastic#54705 smaller.
… (#55456) * Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.
We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.
This gives us better coverage of scenarios like #54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in #54705
smaller.