Skip to content

Commit 76806b0

Browse files
committed
Avoid concurrent snapshot finalization when deleting an initializing snapshot
With the current snapshot/restore logic, a newly created snapshot is added by the SnapshotService.createSnapshot() method as a SnapshotInProgress object in the cluster state. This snapshot has the INIT state. Once the cluster state update is processed, the beginSnapshot() method is executed using the SNAPSHOT thread pool. The beginSnapshot() method starts the initialization of the snapshot using the initializeSnapshot() method. This method reads the repository data and then writes the global metadata file and an index metadata file per index to be snapshotted. These operations can take some time to be completed (many minutes). At this stage and if the master node is disconnected the snapshot can be stucked in INIT state on versions 5.6.4/6.0.0 or lower (pull request elastic#27214 fixed this on 5.6.5/6.0.1 and higher). If the snapshot is not stucked but the initialization takes some time and the user decides to abort the snapshot, a delete snapshot request can sneak in. The deletion updates the cluster state to check the state of the SnapshotInProgress. When the snapshot is in INIT, it executes the endSnapshot() method (which returns immediately) and then the snapshot's state is updated to ABORTED in the cluster state. The deletion will then listen for the snapshot completion in order to continue with the deletion of the snapshot. But before returning, the endSnapshot() method added a new Runnable to the SNAPSHOT thread pool that forces the finalization of the initializing snapshot. This finalization writes the snapshot metadata file and updates the index-N file in the repository. At this stage two things can potentially be executed concurrently: the initialization of the snapshot and the finalization of the snapshot. When the initializeSnapshot() is terminated, the cluster state is updated to start the snapshot and to move it to the STARTED state (this is before elastic#27931 which prevents an ABORTED snapshot to be started at all). The snapshot is started and shards start to be snapshotted but they quickly fail because the snapshot was ABORTED by the deletion. All shards are reported as FAILED to the master node, which executes endSnapshot() too (using SnapshotStateExecutor). Then many things can happen, depending on the execution of tasks by the SNAPSHOT thread pool and the time taken by each read/write/delete operation by the repository implementation. Especially on S3, where operations can take time (disconnections, retries, timeouts) and where the data consistency model allows to read old data or requires some time for objects to be replicated. Here are some scenario seen in cluster logs: a) the snapshot is finalized by the snapshot deletion. Snapshot metadata file exists in the repository so the future finalization by the snapshot creation will fail with a "fail to finalize snapshot" message in logs. Deletion process continues. b) the snapshot is finalized by the snapshot creation. Snapshot metadata file exists in the repository so the future finalization by the snapshot deletion will fail with a "fail to finalize snapshot" message in logs. Deletion process continues. c) both finalizations are executed concurrently, things can fail at different read or write operations. Shards failures can be lost as well as final snapshot state, depending on which SnapshotInProgress.Entry is used to finalize the snapshot. d) the snapshot is finalized by the snapshot deletion, the snapshot in progress is removed from the cluster state, triggering the execution of the completion listeners. The deletion process continues and the deleteSnapshotFromRepository() is executed using the SNAPSHOT thread pool. This method reads the repository data, the snapshot metadata and the index metadata for all indices included in the snapshot before updated the index-N file from the repository. It can also take some time and I think these operations could potentially be executed concurrently with the finalization of the snapshot by the snapshot creation, leading to corrupted data. This commit does not solve all the issues reported here, but it removes the finalization of the snapshot by the snapshot deletion. This way, the deletion marks the snapshot as ABORTED in cluster state and waits for the snapshot completion. It is the responsability of the snapshot execution to detect the abortion and terminates itself correctly. This avoids concurrent snapshot finalizations and also ordinates the operations: the deletion aborts the snapshot and waits for the snapshot completion, the creation detects the abortion and stops by itself and finalizes the snapshot, then the deletion resumes and continues the deletion process.
1 parent fd45a46 commit 76806b0

File tree

2 files changed

+37
-22
lines changed

2 files changed

+37
-22
lines changed

core/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -381,8 +381,13 @@ public ClusterState execute(ClusterState currentState) {
381381
SnapshotsInProgress snapshots = currentState.custom(SnapshotsInProgress.TYPE);
382382
List<SnapshotsInProgress.Entry> entries = new ArrayList<>();
383383
for (SnapshotsInProgress.Entry entry : snapshots.entries()) {
384-
if (entry.snapshot().equals(snapshot.snapshot()) && entry.state() != State.ABORTED) {
385-
// Replace the snapshot that was just created
384+
if (entry.snapshot().equals(snapshot.snapshot()) == false) {
385+
entries.add(entry);
386+
continue;
387+
}
388+
389+
if (entry.state() != State.ABORTED) {
390+
// Replace the snapshot that was just intialized
386391
ImmutableOpenMap<ShardId, SnapshotsInProgress.ShardSnapshotStatus> shards = shards(currentState, entry.indices());
387392
if (!partial) {
388393
Tuple<Set<String>, Set<String>> indicesWithMissingShards = indicesWithMissingShards(shards, currentState.metaData());
@@ -409,11 +414,13 @@ public ClusterState execute(ClusterState currentState) {
409414
}
410415
updatedSnapshot = new SnapshotsInProgress.Entry(entry, State.STARTED, shards);
411416
entries.add(updatedSnapshot);
412-
if (!completed(shards.values())) {
417+
if (completed(shards.values()) == false) {
413418
accepted = true;
414419
}
415420
} else {
416-
entries.add(entry);
421+
failure = "snapshot state changed to " + entry.state() + " during initialization";
422+
updatedSnapshot = entry;
423+
entries.add(updatedSnapshot);
417424
}
418425
}
419426
return ClusterState.builder(currentState)
@@ -750,6 +757,11 @@ public ClusterState execute(ClusterState currentState) throws Exception {
750757
}
751758
entries.add(updatedSnapshot);
752759
} else if (snapshot.state() == State.INIT && newMaster) {
760+
changed = true;
761+
// Mark the snapshot as aborted as it failed to start from the previous master
762+
updatedSnapshot = new SnapshotsInProgress.Entry(snapshot, State.ABORTED, snapshot.shards());
763+
entries.add(updatedSnapshot);
764+
753765
// Clean up the snapshot that failed to start from the old master
754766
deleteSnapshot(snapshot.snapshot(), new DeleteSnapshotListener() {
755767
@Override
@@ -935,7 +947,7 @@ private Tuple<Set<String>, Set<String>> indicesWithMissingShards(ImmutableOpenMa
935947
*
936948
* @param entry snapshot
937949
*/
938-
void endSnapshot(SnapshotsInProgress.Entry entry) {
950+
void endSnapshot(final SnapshotsInProgress.Entry entry) {
939951
endSnapshot(entry, null);
940952
}
941953

@@ -1144,24 +1156,26 @@ public ClusterState execute(ClusterState currentState) throws Exception {
11441156
} else {
11451157
// This snapshot is currently running - stopping shards first
11461158
waitForSnapshot = true;
1147-
ImmutableOpenMap<ShardId, ShardSnapshotStatus> shards;
1148-
if (snapshotEntry.state() == State.STARTED && snapshotEntry.shards() != null) {
1149-
// snapshot is currently running - stop started shards
1150-
ImmutableOpenMap.Builder<ShardId, ShardSnapshotStatus> shardsBuilder = ImmutableOpenMap.builder();
1159+
1160+
final ImmutableOpenMap<ShardId, ShardSnapshotStatus> shards;
1161+
1162+
final State state = snapshotEntry.state();
1163+
if (state == State.INIT) {
1164+
// snapshot is still initializing, mark it as aborted
1165+
shards = snapshotEntry.shards();
1166+
1167+
} else if (snapshotEntry.state() == State.STARTED) {
1168+
// snapshot is started - mark every non completed shard as aborted
1169+
final ImmutableOpenMap.Builder<ShardId, ShardSnapshotStatus> shardsBuilder = ImmutableOpenMap.builder();
11511170
for (ObjectObjectCursor<ShardId, ShardSnapshotStatus> shardEntry : snapshotEntry.shards()) {
11521171
ShardSnapshotStatus status = shardEntry.value;
1153-
if (!status.state().completed()) {
1154-
shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED,
1155-
"aborted by snapshot deletion"));
1156-
} else {
1157-
shardsBuilder.put(shardEntry.key, status);
1172+
if (status.state().completed() == false) {
1173+
status = new ShardSnapshotStatus(status.nodeId(), State.ABORTED, "aborted by snapshot deletion");
11581174
}
1175+
shardsBuilder.put(shardEntry.key, status);
11591176
}
11601177
shards = shardsBuilder.build();
1161-
} else if (snapshotEntry.state() == State.INIT) {
1162-
// snapshot hasn't started yet - end it
1163-
shards = snapshotEntry.shards();
1164-
endSnapshot(snapshotEntry);
1178+
11651179
} else {
11661180
boolean hasUncompletedShards = false;
11671181
// Cleanup in case a node gone missing and snapshot wasn't updated for some reason
@@ -1178,7 +1192,8 @@ public ClusterState execute(ClusterState currentState) throws Exception {
11781192
logger.debug("trying to delete completed snapshot - should wait for shards to finalize on all nodes");
11791193
return currentState;
11801194
} else {
1181-
// no shards to wait for - finish the snapshot
1195+
// no shards to wait for but a node is gone - this is the only case
1196+
// where we force to finish the snapshot
11821197
logger.debug("trying to delete completed snapshot with no finalizing shards - can delete immediately");
11831198
shards = snapshotEntry.shards();
11841199
endSnapshot(snapshotEntry);

core/src/test/java/org/elasticsearch/snapshots/SharedClusterSnapshotRestoreIT.java

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3151,7 +3151,7 @@ public void testSnapshottingWithMissingSequenceNumbers() {
31513151
assertThat(shardStats.getSeqNoStats().getMaxSeqNo(), equalTo(15L));
31523152
}
31533153

3154-
@AwaitsFix(bugUrl = "https://github.com/elastic/elasticsearch/issues/27974")
3154+
@TestLogging("org.elasticsearch.snapshots:TRACE")
31553155
public void testAbortedSnapshotDuringInitDoesNotStart() throws Exception {
31563156
final Client client = client();
31573157

@@ -3163,11 +3163,11 @@ public void testAbortedSnapshotDuringInitDoesNotStart() throws Exception {
31633163
));
31643164

31653165
createIndex("test-idx");
3166-
final int nbDocs = scaledRandomIntBetween(1, 100);
3166+
final int nbDocs = scaledRandomIntBetween(100, 500);
31673167
for (int i = 0; i < nbDocs; i++) {
31683168
index("test-idx", "_doc", Integer.toString(i), "foo", "bar" + i);
31693169
}
3170-
refresh();
3170+
flushAndRefresh("test-idx");
31713171
assertThat(client.prepareSearch("test-idx").setSize(0).get().getHits().getTotalHits(), equalTo((long) nbDocs));
31723172

31733173
// Create a snapshot

0 commit comments

Comments
 (0)