Skip to content

Fix Two Races that Lead to Stuck Snapshots #37686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
2efffab
SNAPSHOT: Keep SnapshotsInProgress State in Sync with Routing Table (…
original-brownbear Nov 22, 2018
715ae9f
SQL: Implement NVL(expr1, expr2) (#35794)
matriv Nov 22, 2018
3f79476
Forbid negative scores in functon_score query (#35709)
mayya-sharipova Nov 22, 2018
9293189
Revert "Revert "[RCI] Check blocks while having index shard permit in…
tlrx Nov 22, 2018
92390c5
Mute test
albertzaharovits Nov 22, 2018
d4701a4
Mute test InternalEngineTests
albertzaharovits Nov 22, 2018
ca1b3c6
[TEST] escape brackets
martijnvg Nov 22, 2018
121a886
Upgrade to lucene-8.0.0-snapshot-67cdd21996 (#35816)
jimczi Nov 22, 2018
f8a7bf6
Remove unnecessary throws IOException in CompressedXContent.string() …
dimitris-athanasiou Nov 22, 2018
9870c74
[ML] Add docs for ML info endpoint (#35783)
droberts195 Nov 22, 2018
b9cba85
[Tests] Fix creating ExplainLifecycleRequest with no indices (#35828)
Nov 23, 2018
51351c5
[Docs] Correct template example description #35829
scampi Nov 23, 2018
c17fa7f
Fixed response classes in hlrc docs
martijnvg Nov 23, 2018
d0b5006
[HLRC][ML] Add ML find file structure API (#35833)
droberts195 Nov 23, 2018
1cf9436
Expose all permits acquisition in IndexShard and TransportReplication…
tlrx Nov 23, 2018
9b96fc8
Fix analyzed prefix query in query_string (#35756)
jimczi Nov 23, 2018
d3db6c6
Copy checkpoint atomically when rolling generation (#35407)
DaveCTurner Nov 23, 2018
fb09f20
Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…
original-brownbear Nov 23, 2018
ce4d520
Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…
original-brownbear Nov 23, 2018
e3f4a99
Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…
original-brownbear Nov 24, 2018
6734384
Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…
original-brownbear Nov 24, 2018
fab6896
Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…
original-brownbear Nov 25, 2018
e72070e
Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…
original-brownbear Nov 28, 2018
2fff632
Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…
original-brownbear Nov 29, 2018
a7d9523
Merge remote-tracking branch 'elastic/master' into feature/snapshot-r…
original-brownbear Dec 6, 2018
87454cb
start
original-brownbear Jan 19, 2019
39337e0
bck
original-brownbear Jan 19, 2019
3b19373
works but gets stuck on recovery
original-brownbear Jan 19, 2019
e5d73b3
reproducer
original-brownbear Jan 20, 2019
263c525
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 20, 2019
d1f36d6
Merge remote-tracking branch 'elastic/feature/snapshot-resilience' in…
original-brownbear Jan 20, 2019
ecdd36c
fixed checkstyle
original-brownbear Jan 20, 2019
9e389bd
Revert "Merge remote-tracking branch 'elastic/feature/snapshot-resili…
original-brownbear Jan 20, 2019
4b618a6
bck
original-brownbear Jan 20, 2019
683c985
nicer
original-brownbear Jan 20, 2019
a520aa1
proper disconnects
original-brownbear Jan 20, 2019
6237813
Revert "Revert "Merge remote-tracking branch 'elastic/feature/snapsho…
original-brownbear Jan 20, 2019
1c0a3b9
Revert "Revert "Revert "Merge remote-tracking branch 'elastic/feature…
original-brownbear Jan 20, 2019
d2232ce
nicer
original-brownbear Jan 20, 2019
e493b41
still passes
original-brownbear Jan 20, 2019
3a9a25a
nicer
original-brownbear Jan 20, 2019
ffd86bd
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 21, 2019
801376f
relocation it
original-brownbear Jan 21, 2019
f969046
bck
original-brownbear Jan 21, 2019
070cb0e
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 21, 2019
cea40ea
reproduced
original-brownbear Jan 21, 2019
86b1f32
fix 1
original-brownbear Jan 21, 2019
a8b20eb
Revert "Revert "Revert "Revert "Merge remote-tracking branch 'elastic…
original-brownbear Jan 21, 2019
c3836f3
bck
original-brownbear Jan 21, 2019
a3bc9fc
Revert "Revert "Revert "Revert "Revert "Merge remote-tracking branch …
original-brownbear Jan 21, 2019
b0d4c99
bck
original-brownbear Jan 21, 2019
db2fd56
tests pass
original-brownbear Jan 21, 2019
cc902e4
nicer
original-brownbear Jan 21, 2019
9ee5ddd
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 21, 2019
54ad3df
nicer
original-brownbear Jan 21, 2019
592d7ca
nicer
original-brownbear Jan 21, 2019
7173c80
another reproducer
original-brownbear Jan 21, 2019
09c7554
another reproducer
original-brownbear Jan 21, 2019
3bd2904
Revert "Revert "Revert "Revert "Revert "Revert "Merge remote-tracking…
original-brownbear Jan 22, 2019
438c40a
Revert "Revert "Revert "Revert "Revert "Revert "Revert "Merge remote-…
original-brownbear Jan 22, 2019
578473c
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 22, 2019
72a9619
nicer
original-brownbear Jan 22, 2019
776c56d
nicer
original-brownbear Jan 22, 2019
013aea1
fix hitting dead node
original-brownbear Jan 22, 2019
0a3ed43
fix hitting dead node
original-brownbear Jan 22, 2019
ff7cfdf
nicer
original-brownbear Jan 22, 2019
634a715
nicer
original-brownbear Jan 22, 2019
d813ac5
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 22, 2019
03e1984
nicer
original-brownbear Jan 22, 2019
a46abfc
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 22, 2019
bc450e8
CR: no shards when init
original-brownbear Jan 22, 2019
803003f
nicer writable register getter
original-brownbear Jan 22, 2019
808c99a
CR: cache failed notifications
original-brownbear Jan 22, 2019
b73527a
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 23, 2019
1366efc
CR: return false from terminate
original-brownbear Jan 23, 2019
326cd56
CR: lower timeout
original-brownbear Jan 23, 2019
23301a7
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 24, 2019
e70fe23
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 28, 2019
8439256
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 28, 2019
3c566c6
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 29, 2019
1b8e8f9
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 30, 2019
af35d2f
nicer noop ActionListener
original-brownbear Jan 30, 2019
8b25eac
remove noisy change
original-brownbear Jan 30, 2019
69f8f7d
CR: renaming + remove noop listener
original-brownbear Jan 31, 2019
708bcf6
Merge remote-tracking branch 'elastic/master' into snapshot-interrupt…
original-brownbear Jan 31, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,10 @@
import org.elasticsearch.repositories.IndexId;
import org.elasticsearch.repositories.Repository;
import org.elasticsearch.threadpool.ThreadPool;
import org.elasticsearch.transport.EmptyTransportResponseHandler;
import org.elasticsearch.transport.TransportException;
import org.elasticsearch.transport.TransportRequestDeduplicator;
import org.elasticsearch.transport.TransportResponse;
import org.elasticsearch.transport.TransportService;

import java.io.IOException;
Expand All @@ -85,7 +89,6 @@
import static java.util.Collections.emptyMap;
import static java.util.Collections.unmodifiableMap;
import static org.elasticsearch.cluster.SnapshotsInProgress.completed;
import static org.elasticsearch.transport.EmptyTransportResponseHandler.INSTANCE_SAME;

/**
* This service runs on data and master nodes and controls currently snapshotted shards on these nodes. It is responsible for
Expand All @@ -112,6 +115,10 @@ public class SnapshotShardsService extends AbstractLifecycleComponent implements

private volatile Map<Snapshot, Map<ShardId, IndexShardSnapshotStatus>> shardSnapshots = emptyMap();

// A map of snapshots to the shardIds that we already reported to the master as failed
private final TransportRequestDeduplicator<UpdateIndexShardSnapshotStatusRequest> remoteFailedRequestDeduplicator =
new TransportRequestDeduplicator<>();

private final SnapshotStateExecutor snapshotStateExecutor = new SnapshotStateExecutor();
private final UpdateSnapshotStatusAction updateSnapshotStatusHandler;

Expand Down Expand Up @@ -272,12 +279,11 @@ private void processIndexShardSnapshots(ClusterChangedEvent event) {
// Abort all running shards for this snapshot
Map<ShardId, IndexShardSnapshotStatus> snapshotShards = shardSnapshots.get(entry.snapshot());
if (snapshotShards != null) {
final String failure = "snapshot has been aborted";
for (ObjectObjectCursor<ShardId, ShardSnapshotStatus> shard : entry.shards()) {

final IndexShardSnapshotStatus snapshotStatus = snapshotShards.get(shard.key);
if (snapshotStatus != null) {
final IndexShardSnapshotStatus.Copy lastSnapshotStatus = snapshotStatus.abortIfNotCompleted(failure);
final IndexShardSnapshotStatus.Copy lastSnapshotStatus =
snapshotStatus.abortIfNotCompleted("snapshot has been aborted");
final Stage stage = lastSnapshotStatus.getStage();
if (stage == Stage.FINALIZE) {
logger.debug("[{}] trying to cancel snapshot on shard [{}] that is finalizing, " +
Expand All @@ -295,6 +301,15 @@ private void processIndexShardSnapshots(ClusterChangedEvent event) {
}
}
}
} else {
final Snapshot snapshot = entry.snapshot();
for (ObjectObjectCursor<ShardId, ShardSnapshotStatus> curr : entry.shards()) {
// due to CS batching we might have missed the INIT state and straight went into ABORTED
// notify master that abort has completed by moving to FAILED
if (curr.value.state() == State.ABORTED) {
notifyFailedSnapshotShard(snapshot, curr.key, localNodeId, curr.value.reason());
}
}
}
}
}
Expand Down Expand Up @@ -515,12 +530,33 @@ void notifyFailedSnapshotShard(final Snapshot snapshot, final ShardId shardId, f

/** Updates the shard snapshot status by sending a {@link UpdateIndexShardSnapshotStatusRequest} to the master node */
void sendSnapshotShardUpdate(final Snapshot snapshot, final ShardId shardId, final ShardSnapshotStatus status) {
try {
UpdateIndexShardSnapshotStatusRequest request = new UpdateIndexShardSnapshotStatusRequest(snapshot, shardId, status);
transportService.sendRequest(transportService.getLocalNode(), UPDATE_SNAPSHOT_STATUS_ACTION_NAME, request, INSTANCE_SAME);
} catch (Exception e) {
logger.warn(() -> new ParameterizedMessage("[{}] [{}] failed to update snapshot state", snapshot, status), e);
}
remoteFailedRequestDeduplicator.executeOnce(
new UpdateIndexShardSnapshotStatusRequest(snapshot, shardId, status),
new ActionListener<Void>() {
@Override
public void onResponse(Void aVoid) {
logger.trace("[{}] [{}] updated snapshot state", snapshot, status);
}

@Override
public void onFailure(Exception e) {
logger.warn(
() -> new ParameterizedMessage("[{}] [{}] failed to update snapshot state", snapshot, status), e);
}
},
(req, reqListener) -> transportService.sendRequest(transportService.getLocalNode(), UPDATE_SNAPSHOT_STATUS_ACTION_NAME, req,
new EmptyTransportResponseHandler(ThreadPool.Names.SAME) {
@Override
public void handleResponse(TransportResponse.Empty response) {
reqListener.onResponse(null);
}

@Override
public void handleException(TransportException exp) {
reqListener.onFailure(exp);
}
})
);
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1210,7 +1210,10 @@ public ClusterState execute(ClusterState currentState) throws Exception {
if (state == State.INIT) {
// snapshot is still initializing, mark it as aborted
shards = snapshotEntry.shards();

assert shards.isEmpty();
// No shards in this snapshot, we delete it right away since the SnapshotShardsService
// has no work to do.
endSnapshot(snapshotEntry);
} else if (state == State.STARTED) {
// snapshot is started - mark every non completed shard as aborted
final ImmutableOpenMap.Builder<ShardId, ShardSnapshotStatus> shardsBuilder = ImmutableOpenMap.builder();
Expand Down
Loading