Snapshot failover/retry on failed shard if a good copy is available #15940
Labels
discuss
:Distributed Coordination/Snapshot/Restore
Anything directly related to the `_snapshot/*` APIs
>enhancement
stalled
Scenario reported by the field is the following.
Periodically, snapshot fails (partial) against a specific shard.
This tends to happen when there is a corruption against a large segment. Since we have to read the entire segment to check for corruption, for large segments, we do not check for corruption until one of the following operations are performed today (snapshot, merge, relocation, peer recovery (when we copy segments over from a primary shard to a replica shard). So this can happen when the cluster is green and snapshot will then detect that a segment is bad and the recovery process will kick in to try to recover the shard from a replica, etc..
If there is a good copy available, snapshot will succeed on that shard on the next scheduled snapshot run. However, for the snapshot operation that was previously issued, there is currently not a failover/retry mechanism to retry the snapshot once recovery is successful, or try snapshot-ing from a copy of the shard instead.
Discussed with @imotov , a solution to this will be complex and we can revisit after the task management api has been implemented in the future so we can keep track of long running jobs.
The text was updated successfully, but these errors were encountered: