Skip to content

Improve Snapshot Abort Behavior #54256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Mar 26, 2020

This commit improves the behavior of aborting snapshots and by that fixes
some extremely rare test failures.

Improvements:

  1. When aborting a snapshot while it is in the INIT stage we do not need
    to ever delete anything from the repository because nothing is written to the
    repo during INIT any more (in the past running deletes for these snapshots made
    sense because we were writing snap- and meta- blobs during the INIT step).
  2. Do not try to finalize snapshots that never moved past INIT. Same reason as
    with the first step. If we never moved past INIT no data was written to the repo
    so no need to now write a useless entry for the aborted snapshot to index-N.
    This is especially true, since the reason the snapshot was aborted during INIT was
    a delete call so the useless empty snapshot just added to index-N would be removed
    by the subsequent delete that is still waiting anyway.
  3. if after aborting a snapshot we wait for it to finish we should not try deleting it
    if it failed. If the snapshot failed it means it did not become part of the most recent
    RepositoryData so a delete for it will needlessly fail with a confusing message about
    that snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore 404 returns from the delete API when using it to make sure a snapshot is aborted+deleted.

Example test failure fixed: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/19730/consoleText

Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts.

Closes #52843

This commit improves the behavior of aborting snapshots and by that fixes
some extremely rare test failures.

Improvements:
1. When aborting a snapshot while it is in the `INIT` stage we do not need
to ever delete anything from the repository because nothing is written to the
repo during INIT any more (in the past running deletes for these snapshots made
sense because we were writing `snap-` and `meta-` blobs during the `INIT` step).
2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as
with the first step. If we never moved past `INIT` no data was written to the repo
so no need to now write a useless entry for the aborted snapshot to `index-N`.
This is especially true, since the reason the snapshot was aborted during `INIT` was
a delete call so the useless empty snapshot just added to `index-N` would be removed
by the subsequent delete that is still waiting anyway.
3. if after aborting a snapshot we wait for it to finish we should not try deleting it
if it failed. If the snapshot failed it means it did not become part of the most recent
`RepositoryData` so a delete for it will needlessly fail with a confusing message about
that snapshot being missing or concurrent repository modification.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@@ -1292,6 +1300,7 @@ public ClusterState execute(ClusterState currentState) {
shards = snapshotEntry.shards();
assert shards.isEmpty();
failure = "Snapshot was aborted during initialization";
abortedDuringInit = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole step is kind of stupid now in 7.6+ because we don't write anything during INIT. Ideally (and I'd do that in a follow-up), we shouldn't move the snapshot to ABORTED here but instead just drop it from the cluster state right away and resolve the listener in beginSnapshot to not have the redundant CS updates from moving to ABORTED and then removing the snapshot from the CS in beginSnapshot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JavaDocs on beginSnapshot should be updated as well, as it claims that the snapshot is created in the repo.

Priority priority = immediatePriority ? Priority.IMMEDIATE : Priority.NORMAL;
logger.info("deleting snapshot [{}] assuming repository generation [{}] and with priory [{}]",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will help debug future issues+test-failures more easily. If we're aborting we get a log sequence as below now (without the newly added information it was impossible to tell if the deletes were due to rerunning the delete after finishing the snapshot or due to REST client retries):

[2020-03-26T11:45:02,542][INFO ][o.e.s.SnapshotsService   ] [asyncIntegTest-0] deleting snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] assuming repository generation [-1] and with priory [NORMAL]
[2020-03-26T11:45:02,640][INFO ][o.e.s.SnapshotsService   ] [asyncIntegTest-0] snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] completed with state [SUCCESS]
[2020-03-26T11:45:02,656][INFO ][o.e.s.SnapshotsService   ] [asyncIntegTest-0] deleting snapshot [test_repository:test_snapshot/P5-mkax-TwupQvH2i4i6Kw] assuming repository generation [0] and with priory [IMMEDIATE]

@@ -1292,6 +1300,7 @@ public ClusterState execute(ClusterState currentState) {
shards = snapshotEntry.shards();
assert shards.isEmpty();
failure = "Snapshot was aborted during initialization";
abortedDuringInit = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JavaDocs on beginSnapshot should be updated as well, as it claims that the snapshot is created in the repo.

} else {
logger.warn("deleted snapshot failed", e);
listener.onFailure(
new SnapshotMissingException(snapshot.getRepository(), snapshot.getSnapshotId(), e));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what cases this is supposed to cover. In particular, I'm wondering about the case where the current node failed (e.g. got disconnected from the rest of the cluster) and another master completed the snapshot. How are the listeners in snapshotCompletionListeners informed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what cases this is supposed to cover.

The only way I see of getting here is the one in the linked test failure.
Master tried to finalize the snapshot and ran into an IOException (or other but I don't see which one).

That said ... you're right, on master fail-over we can leak the snapshotCompletionListeners (urgh ... I wonder if that explain the odd test failure of a hanging snapshot once a month) I'll open another PR for that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#54286 should do it here but I'd like a few hours of SnapshotsResiliencyTests to be sure :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By wrapping the original exception here, I wonder if we potentially turn a failing master (FailedToCommitClusterStateException / NotMasterException) into a SnapshotMissingException

Copy link
Contributor Author

@original-brownbear original-brownbear Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point ... I wonder if we should just pass those two exceptions (failed to commit/ not master) as they come without wrapping. At this point, the delete has not in fact put anything into the cluster state aside from aborting the snapshot. So if we get here and run into one of those master fail-over exceptions, then retrying the delete request (master transport action will do that here) seems what we would actually want to happen right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed 2a2422b for the above

@original-brownbear
Copy link
Contributor Author

Thanks Yannick, updated the Javadocs + fixed typo.

Big thanks for spotting that bug!

@original-brownbear
Copy link
Contributor Author

Jenkins test this (Jenkins locked up mid-way somehow)

@original-brownbear
Copy link
Contributor Author

Jenkins run elasticsearch-ci/1 (unrelated/known failure)

} else {
logger.warn("deleted snapshot failed", e);
listener.onFailure(
new SnapshotMissingException(snapshot.getRepository(), snapshot.getSnapshotId(), e));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather always bubble up the original exception, marking the deletion as failed (and have the client retry). This listener here can be called in a range of situations, and I don't think that in all cases it denotes that the snapshot has been deleted or fully aborted (especially because with waitForSnapshot we are supposed to wait until the snapshot has truly completed, whether exceptional or not).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and I don't think that in all cases it denotes that the snapshot has been deleted or fully aborted

I don't think that's true. With the exception of the master failover exceptions now handled above, all other exceptions method that snapshot finalization failed. Since we never retry the snapshot finalization except for on master fail-over we can be sure that the snapshot will never be created at this point.

If snapshot finalization failed, then the snapshot has not been finalized in the repo (i.e. is not part of the latest index-N) and hence will always throw SnapshotMissingException in deleteSnapshot`.

Without this change, the situation of a failed finalization will behave differently based on timing:

If the finalization fails before the delete comes in, then we get the SnapshotMissingException / 404.
If it fails after the delete comes in, we throw some other SnapshotException wrapping the SnapshotMissingException and needlessly try to find the snapshot in the repo.

=> I think we can cleanly leverage the fact that recent changes made things deterministic here and not run deletes that we know will end up in a 404?

Note: The reason I'm adding these simplifications is (outside of fixing some tests) so that the changes for concurrent snapshots become more obvious. For concurrent snapshot operations we will have to leverage the now very deterministic behavior around snapshot finalizations (and them failing) as well in exactly this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I tried to follow all paths through the code, and couldn't find an issue. The whole listener notification logic seems super brittle though.

@original-brownbear
Copy link
Contributor Author

Thanks Yannick!

@original-brownbear original-brownbear merged commit e3a1f2b into elastic:master Mar 30, 2020
@original-brownbear original-brownbear deleted the repo-abort-snapshot-bug branch March 30, 2020 11:10
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Mar 30, 2020
This commit improves the behavior of aborting snapshots and by that fixes
some extremely rare test failures.

Improvements:
1. When aborting a snapshot while it is in the `INIT` stage we do not need
to ever delete anything from the repository because nothing is written to the
repo during INIT any more (in the past running deletes for these snapshots made
sense because we were writing `snap-` and `meta-` blobs during the `INIT` step).
2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as
with the first step. If we never moved past `INIT` no data was written to the repo
so no need to now write a useless entry for the aborted snapshot to `index-N`.
This is especially true, since the reason the snapshot was aborted during `INIT` was
a delete call so the useless empty snapshot just added to `index-N` would be removed
by the subsequent delete that is still waiting anyway.
3. if after aborting a snapshot we wait for it to finish we should not try deleting it
if it failed. If the snapshot failed it means it did not become part of the most recent
`RepositoryData` so a delete for it will needlessly fail with a confusing message about
that snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore `404` returns from the delete API when using it to make sure a snapshot is aborted+deleted.

Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts.

Closes elastic#52843
original-brownbear added a commit that referenced this pull request Mar 30, 2020
This commit improves the behavior of aborting snapshots and by that fixes
some extremely rare test failures.

Improvements:
1. When aborting a snapshot while it is in the `INIT` stage we do not need
to ever delete anything from the repository because nothing is written to the
repo during INIT any more (in the past running deletes for these snapshots made
sense because we were writing `snap-` and `meta-` blobs during the `INIT` step).
2. Do not try to finalize snapshots that never moved past `INIT`. Same reason as
with the first step. If we never moved past `INIT` no data was written to the repo
so no need to now write a useless entry for the aborted snapshot to `index-N`.
This is especially true, since the reason the snapshot was aborted during `INIT` was
a delete call so the useless empty snapshot just added to `index-N` would be removed
by the subsequent delete that is still waiting anyway.
3. if after aborting a snapshot we wait for it to finish we should not try deleting it
if it failed. If the snapshot failed it means it did not become part of the most recent
`RepositoryData` so a delete for it will needlessly fail with a confusing message about
that snapshot being missing or concurrent repository modification. I moved to throw the snapshot missing exception here because that seems the most user friendly. This allows the user to simply ignore `404` returns from the delete API when using it to make sure a snapshot is aborted+deleted.

Marking this as a non-issue since it doesn't have any negative repercussions other than confusing exceptions on some snapshot aborts.

Closes #52843
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 7, 2020
We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like elastic#54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in elastic#54705
smaller.
original-brownbear added a commit that referenced this pull request Apr 8, 2020
* Add Snapshot Resiliency Test for Master Failover during Delete

We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like #54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in #54705
smaller.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Apr 20, 2020
…ic#54866)

* Add Snapshot Resiliency Test for Master Failover during Delete

We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like elastic#54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in elastic#54705
smaller.
original-brownbear added a commit that referenced this pull request Apr 20, 2020
… (#55456)

* Add Snapshot Resiliency Test for Master Failover during Delete

We only have very indirect coverage of master failovers during snaphot delete
at the moment. This comment adds a direct test of this scenario and also
an assertion that makes sure we are not leaking any snapshot completion listeners
in the snapshots service in this scenario.

This gives us better coverage of scenarios like #54256 and makes the diff
to the upcoming more consistent snapshot delete implementation in #54705
smaller.
@original-brownbear original-brownbear restored the repo-abort-snapshot-bug branch August 6, 2020 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] SLMSnapshotBlockingIntegTests. testSnapshotInProgress fails with RepositoryException
4 participants