-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Improve Snapshot Abort Behavior #54256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Snapshot Abort Behavior #54256
Conversation
This commit improves the behavior of aborting snapshots and by that fixes some extremely rare test failures. Improvements: 1. When aborting a snapshot while it is in the `INIT` stage we do not need to ever delete anything from the repository because nothing is written to the repo during INIT any more (in the past running deletes for these snapshots made sense because we were writing `snap-` and `meta-` blobs during the `INIT` step). 2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as with the first step. If we never moved past `INIT` no data was written to the repo so no need to now write a useless entry for the aborted snapshot to `index-N`. This is especially true, since the reason the snapshot was aborted during `INIT` was a delete call so the useless empty snapshot just added to `index-N` would be removed by the subsequent delete that is still waiting anyway. 3. if after aborting a snapshot we wait for it to finish we should not try deleting it if it failed. If the snapshot failed it means it did not become part of the most recent `RepositoryData` so a delete for it will needlessly fail with a confusing message about that snapshot being missing or concurrent repository modification.
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
@@ -1292,6 +1300,7 @@ public ClusterState execute(ClusterState currentState) { | |||
shards = snapshotEntry.shards(); | |||
assert shards.isEmpty(); | |||
failure = "Snapshot was aborted during initialization"; | |||
abortedDuringInit = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole step is kind of stupid now in 7.6+
because we don't write anything during INIT
. Ideally (and I'd do that in a follow-up), we shouldn't move the snapshot to ABORTED
here but instead just drop it from the cluster state right away and resolve the listener in beginSnapshot
to not have the redundant CS updates from moving to ABORTED
and then removing the snapshot from the CS in beginSnapshot
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JavaDocs on beginSnapshot
should be updated as well, as it claims that the snapshot is created in the repo.
Priority priority = immediatePriority ? Priority.IMMEDIATE : Priority.NORMAL; | ||
logger.info("deleting snapshot [{}] assuming repository generation [{}] and with priory [{}]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will help debug future issues+test-failures more easily. If we're aborting we get a log sequence as below now (without the newly added information it was impossible to tell if the deletes were due to rerunning the delete after finishing the snapshot or due to REST client retries):
[2020-03-26T11:45:02,542][INFO ][o.e.s.SnapshotsService ] [asyncIntegTest-0] deleting snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] assuming repository generation [-1] and with priory [NORMAL]
[2020-03-26T11:45:02,640][INFO ][o.e.s.SnapshotsService ] [asyncIntegTest-0] snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] completed with state [SUCCESS]
[2020-03-26T11:45:02,656][INFO ][o.e.s.SnapshotsService ] [asyncIntegTest-0] deleting snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] assuming repository generation [0] and with priory [IMMEDIATE]
server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
Outdated
Show resolved
Hide resolved
@@ -1292,6 +1300,7 @@ public ClusterState execute(ClusterState currentState) { | |||
shards = snapshotEntry.shards(); | |||
assert shards.isEmpty(); | |||
failure = "Snapshot was aborted during initialization"; | |||
abortedDuringInit = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JavaDocs on beginSnapshot
should be updated as well, as it claims that the snapshot is created in the repo.
} else { | ||
logger.warn("deleted snapshot failed", e); | ||
listener.onFailure( | ||
new SnapshotMissingException(snapshot.getRepository(), snapshot.getSnapshotId(), e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what cases this is supposed to cover. In particular, I'm wondering about the case where the current node failed (e.g. got disconnected from the rest of the cluster) and another master completed the snapshot. How are the listeners in snapshotCompletionListeners
informed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what cases this is supposed to cover.
The only way I see of getting here is the one in the linked test failure.
Master tried to finalize the snapshot and ran into an IOException
(or other but I don't see which one).
That said ... you're right, on master fail-over we can leak the snapshotCompletionListeners
(urgh ... I wonder if that explain the odd test failure of a hanging snapshot once a month) I'll open another PR for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#54286 should do it here but I'd like a few hours of SnapshotsResiliencyTests
to be sure :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By wrapping the original exception here, I wonder if we potentially turn a failing master (FailedToCommitClusterStateException / NotMasterException) into a SnapshotMissingException
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point ... I wonder if we should just pass those two exceptions (failed to commit/ not master) as they come without wrapping. At this point, the delete has not in fact put anything into the cluster state aside from aborting the snapshot. So if we get here and run into one of those master fail-over exceptions, then retrying the delete request (master transport action will do that here) seems what we would actually want to happen right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed 2a2422b for the above
Thanks Yannick, updated the Javadocs + fixed typo. Big thanks for spotting that bug! |
Jenkins test this (Jenkins locked up mid-way somehow) |
Jenkins run elasticsearch-ci/1 (unrelated/known failure) |
} else { | ||
logger.warn("deleted snapshot failed", e); | ||
listener.onFailure( | ||
new SnapshotMissingException(snapshot.getRepository(), snapshot.getSnapshotId(), e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would rather always bubble up the original exception, marking the deletion as failed (and have the client retry). This listener here can be called in a range of situations, and I don't think that in all cases it denotes that the snapshot has been deleted or fully aborted (especially because with waitForSnapshot we are supposed to wait until the snapshot has truly completed, whether exceptional or not).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and I don't think that in all cases it denotes that the snapshot has been deleted or fully aborted
I don't think that's true. With the exception of the master failover exceptions now handled above, all other exceptions method that snapshot finalization failed. Since we never retry the snapshot finalization except for on master fail-over we can be sure that the snapshot will never be created at this point.
If snapshot finalization failed, then the snapshot has not been finalized in the repo (i.e. is not part of the latest index-N) and hence will always throw SnapshotMissingException
in deleteSnapshot`.
Without this change, the situation of a failed finalization will behave differently based on timing:
If the finalization fails before the delete comes in, then we get the SnapshotMissingException
/ 404.
If it fails after the delete comes in, we throw some other SnapshotException
wrapping the SnapshotMissingException
and needlessly try to find the snapshot in the repo.
=> I think we can cleanly leverage the fact that recent changes made things deterministic here and not run deletes that we know will end up in a 404
?
Note: The reason I'm adding these simplifications is (outside of fixing some tests) so that the changes for concurrent snapshots become more obvious. For concurrent snapshot operations we will have to leverage the now very deterministic behavior around snapshot finalizations (and them failing) as well in exactly this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I tried to follow all paths through the code, and couldn't find an issue. The whole listener notification logic seems super brittle though.
Thanks Yannick! |
This commit improves the behavior of aborting snapshots and by that fixes some extremely rare test failures. Improvements: 1. When aborting a snapshot while it is in the `INIT` stage we do not need to ever delete anything from the repository because nothing is written to the repo during INIT any more (in the past running deletes for these snapshots made sense because we were writing `snap-` and `meta-` blobs during the `INIT` step). 2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as with the first step. If we never moved past `INIT` no data was written to the repo so no need to now write a useless entry for the aborted snapshot to `index-N`. This is especially true, since the reason the snapshot was aborted during `INIT` was a delete call so the useless empty snapshot just added to `index-N` would be removed by the subsequent delete that is still waiting anyway. 3. if after aborting a snapshot we wait for it to finish we should not try deleting it if it failed. If the snapshot failed it means it did not become part of the most recent `RepositoryData` so a delete for it will needlessly fail with a confusing message about that snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore `404` returns from the delete API when using it to make sure a snapshot is aborted+deleted. Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts. Closes elastic#52843
This commit improves the behavior of aborting snapshots and by that fixes some extremely rare test failures. Improvements: 1. When aborting a snapshot while it is in the `INIT` stage we do not need to ever delete anything from the repository because nothing is written to the repo during INIT any more (in the past running deletes for these snapshots made sense because we were writing `snap-` and `meta-` blobs during the `INIT` step). 2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as with the first step. If we never moved past `INIT` no data was written to the repo so no need to now write a useless entry for the aborted snapshot to `index-N`. This is especially true, since the reason the snapshot was aborted during `INIT` was a delete call so the useless empty snapshot just added to `index-N` would be removed by the subsequent delete that is still waiting anyway. 3. if after aborting a snapshot we wait for it to finish we should not try deleting it if it failed. If the snapshot failed it means it did not become part of the most recent `RepositoryData` so a delete for it will needlessly fail with a confusing message about that snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore `404` returns from the delete API when using it to make sure a snapshot is aborted+deleted. Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts. Closes #52843
We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like elastic#54256 and makes the diff to the upcoming more consistent snapshot delete implementation in elastic#54705 smaller.
* Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.
…ic#54866) * Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like elastic#54256 and makes the diff to the upcoming more consistent snapshot delete implementation in elastic#54705 smaller.
… (#55456) * Add Snapshot Resiliency Test for Master Failover during Delete We only have very indirect coverage of master failovers during snaphot delete at the moment. This comment adds a direct test of this scenario and also an assertion that makes sure we are not leaking any snapshot completion listeners in the snapshots service in this scenario. This gives us better coverage of scenarios like #54256 and makes the diff to the upcoming more consistent snapshot delete implementation in #54705 smaller.
This commit improves the behavior of aborting snapshots and by that fixes
some extremely rare test failures.
Improvements:
INIT
stage we do not needto ever delete anything from the repository because nothing is written to the
repo during INIT any more (in the past running deletes for these snapshots made
sense because we were writing
snap-
andmeta-
blobs during theINIT
step).INIT
. Same reason aswith the first step. If we never moved past
INIT
no data was written to the reposo no need to now write a useless entry for the aborted snapshot to
index-N
.This is especially true, since the reason the snapshot was aborted during
INIT
wasa delete call so the useless empty snapshot just added to
index-N
would be removedby the subsequent delete that is still waiting anyway.
if it failed. If the snapshot failed it means it did not become part of the most recent
RepositoryData
so a delete for it will needlessly fail with a confusing message aboutthat snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore
404
returns from the delete API when using it to make sure a snapshot is aborted+deleted.Example test failure fixed: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/19730/consoleText
Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts.
Closes #52843