Improve Snapshot Abort Behavior #54256

original-brownbear · 2020-03-26T09:42:14Z

This commit improves the behavior of aborting snapshots and by that fixes
some extremely rare test failures.

Improvements:

When aborting a snapshot while it is in the INIT stage we do not need
to ever delete anything from the repository because nothing is written to the
repo during INIT any more (in the past running deletes for these snapshots made
sense because we were writing snap- and meta- blobs during the INIT step).
Do not try to finalize snapshots that never moved past INIT. Same reason as
with the first step. If we never moved past INIT no data was written to the repo
so no need to now write a useless entry for the aborted snapshot to index-N.
This is especially true, since the reason the snapshot was aborted during INIT was
a delete call so the useless empty snapshot just added to index-N would be removed
by the subsequent delete that is still waiting anyway.
if after aborting a snapshot we wait for it to finish we should not try deleting it
if it failed. If the snapshot failed it means it did not become part of the most recent
RepositoryData so a delete for it will needlessly fail with a confusing message about
that snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore 404 returns from the delete API when using it to make sure a snapshot is aborted+deleted.

Example test failure fixed: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/19730/consoleText

Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts.

Closes #52843

This commit improves the behavior of aborting snapshots and by that fixes some extremely rare test failures. Improvements: 1. When aborting a snapshot while it is in the `INIT` stage we do not need to ever delete anything from the repository because nothing is written to the repo during INIT any more (in the past running deletes for these snapshots made sense because we were writing `snap-` and `meta-` blobs during the `INIT` step). 2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as with the first step. If we never moved past `INIT` no data was written to the repo so no need to now write a useless entry for the aborted snapshot to `index-N`. This is especially true, since the reason the snapshot was aborted during `INIT` was a delete call so the useless empty snapshot just added to `index-N` would be removed by the subsequent delete that is still waiting anyway. 3. if after aborting a snapshot we wait for it to finish we should not try deleting it if it failed. If the snapshot failed it means it did not become part of the most recent `RepositoryData` so a delete for it will needlessly fail with a confusing message about that snapshot being missing or concurrent repository modification.

elasticmachine · 2020-03-26T09:42:17Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2020-03-26T09:45:54Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

@@ -1292,6 +1300,7 @@ public ClusterState execute(ClusterState currentState) {
                        shards = snapshotEntry.shards();
                        assert shards.isEmpty();
                        failure = "Snapshot was aborted during initialization";
+                        abortedDuringInit = true;


This whole step is kind of stupid now in 7.6+ because we don't write anything during INIT. Ideally (and I'd do that in a follow-up), we shouldn't move the snapshot to ABORTED here but instead just drop it from the cluster state right away and resolve the listener in beginSnapshot to not have the redundant CS updates from moving to ABORTED and then removing the snapshot from the CS in beginSnapshot.

The JavaDocs on beginSnapshot should be updated as well, as it claims that the snapshot is created in the repo.

original-brownbear · 2020-03-26T11:27:50Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

        Priority priority = immediatePriority ? Priority.IMMEDIATE : Priority.NORMAL;
+        logger.info("deleting snapshot [{}] assuming repository generation [{}] and with priory [{}]",


This change will help debug future issues+test-failures more easily. If we're aborting we get a log sequence as below now (without the newly added information it was impossible to tell if the deletes were due to rerunning the delete after finishing the snapshot or due to REST client retries):

[2020-03-26T11:45:02,542][INFO ][o.e.s.SnapshotsService ] [asyncIntegTest-0] deleting snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] assuming repository generation [-1] and with priory [NORMAL] [2020-03-26T11:45:02,640][INFO ][o.e.s.SnapshotsService ] [asyncIntegTest-0] snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] completed with state [SUCCESS] [2020-03-26T11:45:02,656][INFO ][o.e.s.SnapshotsService ] [asyncIntegTest-0] deleting snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] assuming repository generation [0] and with priory [IMMEDIATE]

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

ywelsch · 2020-03-26T14:41:11Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

@@ -1292,6 +1300,7 @@ public ClusterState execute(ClusterState currentState) {
                        shards = snapshotEntry.shards();
                        assert shards.isEmpty();
                        failure = "Snapshot was aborted during initialization";
+                        abortedDuringInit = true;


The JavaDocs on beginSnapshot should be updated as well, as it claims that the snapshot is created in the repo.

ywelsch · 2020-03-26T14:55:13Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                            } else {
+                                logger.warn("deleted snapshot failed", e);
+                                listener.onFailure(
+                                    new SnapshotMissingException(snapshot.getRepository(), snapshot.getSnapshotId(), e));


I'm not sure what cases this is supposed to cover. In particular, I'm wondering about the case where the current node failed (e.g. got disconnected from the rest of the cluster) and another master completed the snapshot. How are the listeners in snapshotCompletionListeners informed?

I'm not sure what cases this is supposed to cover.

The only way I see of getting here is the one in the linked test failure.
Master tried to finalize the snapshot and ran into an IOException (or other but I don't see which one).

That said ... you're right, on master fail-over we can leak the snapshotCompletionListeners (urgh ... I wonder if that explain the odd test failure of a hanging snapshot once a month) I'll open another PR for that?

#54286 should do it here but I'd like a few hours of SnapshotsResiliencyTests to be sure :)

By wrapping the original exception here, I wonder if we potentially turn a failing master (FailedToCommitClusterStateException / NotMasterException) into a SnapshotMissingException

That's a good point ... I wonder if we should just pass those two exceptions (failed to commit/ not master) as they come without wrapping. At this point, the delete has not in fact put anything into the cluster state aside from aborting the snapshot. So if we get here and run into one of those master fail-over exceptions, then retrying the delete request (master transport action will do that here) seems what we would actually want to happen right?

Pushed 2a2422b for the above

original-brownbear · 2020-03-26T15:21:24Z

Thanks Yannick, updated the Javadocs + fixed typo.

Big thanks for spotting that bug!

…t-bug

original-brownbear · 2020-03-27T11:29:01Z

Jenkins test this (Jenkins locked up mid-way somehow)

original-brownbear · 2020-03-27T11:49:43Z

Jenkins run elasticsearch-ci/1 (unrelated/known failure)

ywelsch · 2020-03-30T07:42:34Z

server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

+                                } else {
+                                    logger.warn("deleted snapshot failed", e);
+                                    listener.onFailure(
+                                        new SnapshotMissingException(snapshot.getRepository(), snapshot.getSnapshotId(), e));


I think I would rather always bubble up the original exception, marking the deletion as failed (and have the client retry). This listener here can be called in a range of situations, and I don't think that in all cases it denotes that the snapshot has been deleted or fully aborted (especially because with waitForSnapshot we are supposed to wait until the snapshot has truly completed, whether exceptional or not).

and I don't think that in all cases it denotes that the snapshot has been deleted or fully aborted

I don't think that's true. With the exception of the master failover exceptions now handled above, all other exceptions method that snapshot finalization failed. Since we never retry the snapshot finalization except for on master fail-over we can be sure that the snapshot will never be created at this point.

If snapshot finalization failed, then the snapshot has not been finalized in the repo (i.e. is not part of the latest index-N) and hence will always throw SnapshotMissingException in deleteSnapshot`.

Without this change, the situation of a failed finalization will behave differently based on timing:

If the finalization fails before the delete comes in, then we get the SnapshotMissingException / 404.
If it fails after the delete comes in, we throw some other SnapshotException wrapping the SnapshotMissingException and needlessly try to find the snapshot in the repo.

=> I think we can cleanly leverage the fact that recent changes made things deterministic here and not run deletes that we know will end up in a 404?

Note: The reason I'm adding these simplifications is (outside of fixing some tests) so that the changes for concurrent snapshots become more obvious. For concurrent snapshot operations we will have to leverage the now very deterministic behavior around snapshot finalizations (and them failing) as well in exactly this way.

OK, I tried to follow all paths through the code, and couldn't find an issue. The whole listener notification logic seems super brittle though.

original-brownbear · 2020-03-30T11:10:31Z

Thanks Yannick!

This commit improves the behavior of aborting snapshots and by that fixes some extremely rare test failures. Improvements: 1. When aborting a snapshot while it is in the `INIT` stage we do not need to ever delete anything from the repository because nothing is written to the repo during INIT any more (in the past running deletes for these snapshots made sense because we were writing `snap-` and `meta-` blobs during the `INIT` step). 2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as with the first step. If we never moved past `INIT` no data was written to the repo so no need to now write a useless entry for the aborted snapshot to `index-N`. This is especially true, since the reason the snapshot was aborted during `INIT` was a delete call so the useless empty snapshot just added to `index-N` would be removed by the subsequent delete that is still waiting anyway. 3. if after aborting a snapshot we wait for it to finish we should not try deleting it if it failed. If the snapshot failed it means it did not become part of the most recent `RepositoryData` so a delete for it will needlessly fail with a confusing message about that snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore `404` returns from the delete API when using it to make sure a snapshot is aborted+deleted. Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts. Closes elastic#52843

This commit improves the behavior of aborting snapshots and by that fixes some extremely rare test failures. Improvements: 1. When aborting a snapshot while it is in the `INIT` stage we do not need to ever delete anything from the repository because nothing is written to the repo during INIT any more (in the past running deletes for these snapshots made sense because we were writing `snap-` and `meta-` blobs during the `INIT` step). 2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as with the first step. If we never moved past `INIT` no data was written to the repo so no need to now write a useless entry for the aborted snapshot to `index-N`. This is especially true, since the reason the snapshot was aborted during `INIT` was a delete call so the useless empty snapshot just added to `index-N` would be removed by the subsequent delete that is still waiting anyway. 3. if after aborting a snapshot we wait for it to finish we should not try deleting it if it failed. If the snapshot failed it means it did not become part of the most recent `RepositoryData` so a delete for it will needlessly fail with a confusing message about that snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore `404` returns from the delete API when using it to make sure a snapshot is aborted+deleted. Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts. Closes #52843

We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like elastic#54256 and makes the diff to the upcoming more consistent snapshot delete implementation in elastic#54705 smaller.

* Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.

…ic#54866) * Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like elastic#54256 and makes the diff to the upcoming more consistent snapshot delete implementation in elastic#54705 smaller.

… (#55456) * Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.

original-brownbear added 2 commits March 26, 2020 10:31

bck

56f1105

original-brownbear added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Mar 26, 2020

original-brownbear commented Mar 26, 2020

View reviewed changes

nope

53e9ae1

original-brownbear commented Mar 26, 2020

View reviewed changes

original-brownbear requested review from ywelsch and tlrx March 26, 2020 11:28

ywelsch reviewed Mar 26, 2020

View reviewed changes

fixes

7155cc7

original-brownbear requested a review from ywelsch March 26, 2020 15:20

original-brownbear added 4 commits March 27, 2020 11:09

Merge remote-tracking branch 'elastic/master' into repo-abort-snapsho…

026dda4

…t-bug

retry on master failover

2a2422b

Merge remote-tracking branch 'elastic/master' into repo-abort-snapsho…

92e0db0

…t-bug

CS

e373a88

original-brownbear mentioned this pull request Mar 27, 2020

[CI] SLMSnapshotBlockingIntegTests. testSnapshotInProgress fails with RepositoryException #52843

Closed

ywelsch reviewed Mar 30, 2020

View reviewed changes

original-brownbear requested a review from ywelsch March 30, 2020 08:03

ywelsch approved these changes Mar 30, 2020

View reviewed changes

original-brownbear merged commit e3a1f2b into elastic:master Mar 30, 2020

original-brownbear deleted the repo-abort-snapshot-bug branch March 30, 2020 11:10

original-brownbear added the backport pending label Mar 30, 2020

original-brownbear mentioned this pull request Mar 30, 2020

Improve Snapshot Abort Behavior (#54256) #54410

Merged

original-brownbear removed the backport pending label Mar 30, 2020

original-brownbear mentioned this pull request Apr 7, 2020

Add Snapshot Resiliency Test for Master Failover during Delete #54866

Merged

original-brownbear mentioned this pull request Apr 20, 2020

Add Snapshot Resiliency Test for Master Failover during Delete (#54866) #55456

Merged

original-brownbear restored the repo-abort-snapshot-bug branch August 6, 2020 18:23

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

		Priority priority = immediatePriority ? Priority.IMMEDIATE : Priority.NORMAL;
		logger.info("deleting snapshot [{}] assuming repository generation [{}] and with priory [{}]",

Improve Snapshot Abort Behavior #54256

Improve Snapshot Abort Behavior #54256

Uh oh!

Conversation

original-brownbear commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Mar 26, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear Mar 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Mar 26, 2020

Uh oh!

original-brownbear commented Mar 27, 2020

Uh oh!

original-brownbear commented Mar 27, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Mar 30, 2020

Uh oh!

Uh oh!

original-brownbear commented Mar 26, 2020 •

edited

Loading

original-brownbear Mar 27, 2020 •

edited

Loading