Snapshot failover/retry on failed shard if a good copy is available #15940

ppf2 · 2016-01-12T21:19:20Z

Scenario reported by the field is the following.

Periodically, snapshot fails (partial) against a specific shard.

[2015-11-10 07:20:37,413][WARN ][snapshots ] [node_name] [[index_name][1]] [snapshot:20151110t071646z] failed to create snapshot 
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: [index_name][1] Failed to snapshot 
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:100) 
at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:871) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745) 
Caused by: org.elasticsearch.index.engine.FlushFailedEngineException: [index_name][1] Flush failed 
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:715) 
at org.elasticsearch.index.engine.InternalEngine.snapshotIndex(InternalEngine.java:846) 
at org.elasticsearch.index.shard.IndexShard.snapshotIndex(IndexShard.java:772) 
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:83) 
... 4 more 
Caused by: org.apache.lucene.index.CorruptIndexException: [index_name][1] Preexisting corrupted index [corrupted_JwkJ91qoSs2cbwcrhNb0iA] caused by: CorruptIndexException[verification failed : calculated=14wqrat stored=n88qsu] 
org.apache.lucene.index.CorruptIndexException: verification failed : calculated=14wqrat stored=n88qsu 
at org.elasticsearch.index.store.Store$VerifyingIndexInput.verify(Store.java:1507) 
at org.elasticsearch.index.store.Store.verify(Store.java:505) 
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshotFile(BlobStoreIndexShardRepository.java:568) 
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$SnapshotContext.snapshot(BlobStoreIndexShardRepository.java:507) 
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.snapshot(BlobStoreIndexShardRepository.java:140) 
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:85) 
at org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:871) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)

at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:602) 
at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:583) 
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:150) 
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:709) 
... 7 more

This tends to happen when there is a corruption against a large segment. Since we have to read the entire segment to check for corruption, for large segments, we do not check for corruption until one of the following operations are performed today (snapshot, merge, relocation, peer recovery (when we copy segments over from a primary shard to a replica shard). So this can happen when the cluster is green and snapshot will then detect that a segment is bad and the recovery process will kick in to try to recover the shard from a replica, etc..

If there is a good copy available, snapshot will succeed on that shard on the next scheduled snapshot run. However, for the snapshot operation that was previously issued, there is currently not a failover/retry mechanism to retry the snapshot once recovery is successful, or try snapshot-ing from a copy of the shard instead.

Discussed with @imotov , a solution to this will be complex and we can revisit after the task management api has been implemented in the future so we can keep track of long running jobs.

The text was updated successfully, but these errors were encountered:

tlrx · 2018-03-22T15:01:18Z

This feature request is interesting but since its opening we have not seen enough feedback that it is a feature we should pursue. We prefer to close this issue as a clear indication that we are not going to work on this at this time. We are always open to reconsidering this in the future based on compelling feedback; despite this issue being closed please feel free to leave feedback on the proposal (including +1s).

ppf2 added >enhancement discuss stalled :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jan 12, 2016

tlrx closed this as completed Mar 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot failover/retry on failed shard if a good copy is available #15940

Snapshot failover/retry on failed shard if a good copy is available #15940

ppf2 commented Jan 12, 2016

tlrx commented Mar 22, 2018

Snapshot failover/retry on failed shard if a good copy is available #15940

Snapshot failover/retry on failed shard if a good copy is available #15940

Comments

ppf2 commented Jan 12, 2016

tlrx commented Mar 22, 2018