-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Fail restore when the shard allocations max retries count is reached #27493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail restore when the shard allocations max retries count is reached #27493
Conversation
// check if the maximum number of attempts to restore the shard has been reached. If so, we can fail | ||
// the restore and leave the shards unassigned. | ||
IndexMetaData indexMetaData = metaData.getIndexSafe(unassignedShard.getKey().getIndex()); | ||
int maxRetry = MaxRetryAllocationDecider.SETTING_ALLOCATION_MAX_RETRY.get(indexMetaData.getSettings()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the solution needs to be more generic than depending on the settings of specific allocation deciders. I think we can use unassignedInfo.getLastAllocationStatus
for that and check if it is DECIDERS_NO
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good suggestion as I suppose that a restore can also be stuck because the deciders cannot assign the shard (no enough space on disk, awareness rules forbid allocation etc). I also like it to be more generic.
I think I can give it a try by reverted portion of code and override unassignedInfoUpdated()
... I'll push something if it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks a lot for fixing it!
Thanks for your reviews. I think @ywelsch is right, we can be more generic by checking the last allocation status and fail the restore if needed. I updated the code to revert some changes I did and use the last allocation status instead. I also added another test based on shard allocation filtering settings that prevent the shards to be assigned. Please let me know what you think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer this approach over the previous one. I've left some more suggestions here and there. It will be interesting to see how this plays together with #27086
where we plan on adding delay between retrying failed allocations. If we chose to infinitely retry (with exponential backoff), then this would be disastrous for the restore here (as it would never abort).
if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) { | ||
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot(); | ||
String reason = "shard was denied allocation by all allocation deciders"; | ||
changes(snapshot).unassignedShards.put(unassignedShard.shardId(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of adding more types here (unassignedShards), better we do the reverse and fold failedShards, startedShards and unassignedShards into just "updates".
It's not worth separating them just to have this one assertion I've put there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pff I should have seen that... I agree, that would be better, thanks :)
if (recoverySource.getType() == RecoverySource.Type.SNAPSHOT) { | ||
if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) { | ||
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot(); | ||
String reason = "shard was denied allocation by all allocation deciders"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just put "shard could not be allocated on any of the nodes"
.setSettings(Settings.builder().put("location", repositoryLocation))); | ||
|
||
// create a test index | ||
final String indexName = randomAlphaOfLength(10).toLowerCase(Locale.ROOT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just choose a fixed index name, no need for randomization here :)
|
||
// attempt to restore the snapshot | ||
RestoreSnapshotResponse restoreResponse = client().admin().cluster().prepareRestoreSnapshot("test-repo", "test-snap") | ||
.setMasterNodeTimeout("30s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this the default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, thanks
assertBusy(() -> { | ||
Client client = client(); | ||
for (int shardId = 0; shardId < numShards.numPrimaries; shardId++) { | ||
ClusterAllocationExplainResponse allocationResponse = client.admin().cluster().prepareAllocationExplain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for allocation explain API. you can check all this directly on the cluster state that you get below. I also think that assertBusy won't be needed then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we can use the cluster state in this case and it's even easier. But it seems that the assertBusy() is still needed as the cluster state change can take few miliseconds to propagate.
// Option 2: delete the index and restore again | ||
assertAcked(client().admin().indices().prepareDelete(indexName)); | ||
restoreResponse = client().admin().cluster().prepareRestoreSnapshot("test-repo", "test-snap") | ||
.setMasterNodeTimeout("30s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
} | ||
|
||
// Wait for the shards to be assigned | ||
waitForRelocation(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just ensureGreen()
* Test that restoring an index with shard allocation filtering settings that prevents | ||
* its allocation does not hang indefinitely. | ||
*/ | ||
public void testUnrestorableIndexDuringRestore() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's possible to share most of the code with the previous test by calling a generic method with two parameters:
- restoreIndexSettings (which would set maxRetries in first case and filters in second case)
- fixupAction (the action to run to fix the issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to do something like that, please let me know what you think.
Thanks for the reviews! I updated the code, can you please have another look?
Right and this is what this pull request tries to avoid. In case of infinite retries we could maybe add a timeout/maximum execution time at the restore request level and fail the restore if this timeout is reached. |
42bb551
to
b32842d
Compare
@ywelsch I added the allocation decider we talked about and I had to rebase and adapt an existing test. The CI is still failing but I'm not sure it is related to my change, I'm still digging. Anyway, I'd be happy if you can have a look, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks great. I've left some minor comments
if (recoverySource.getType() == RecoverySource.Type.SNAPSHOT) { | ||
if (newUnassignedInfo.getLastAllocationStatus() == UnassignedInfo.AllocationStatus.DECIDERS_NO) { | ||
Snapshot snapshot = ((SnapshotRecoverySource) recoverySource).snapshot(); | ||
String reason = "shard could not be allocated on any of the nodes"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allocated to
|
||
final RecoverySource recoverySource = shardRouting.recoverySource(); | ||
if (recoverySource == null || recoverySource.getType() != RecoverySource.Type.SNAPSHOT) { | ||
return allocation.decision(Decision.YES, NAME, "shard is not being restored"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use the following message:
"ignored as shard is not being recovered from a snapshot"
and not have an explicit check for shardRouting.primary() == false
. That case is automatically handled by this case too as replica shards are never recovered from snapshot (their recovery source is always PEER).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks
} | ||
} | ||
return allocation.decision(Decision.NO, NAME, "shard has failed to be restored from the snapshot [%s] because of [%s] - " + | ||
"manually delete the index [%s] in order to retry to restore the snapshot again or use the Reroute API to force the " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"close or delete the index"
I would also lowercase the "reroute API"
if (restoresInProgress != null) { | ||
for (RestoreInProgress.Entry restoreInProgress : restoresInProgress.entries()) { | ||
if (restoreInProgress.snapshot().equals(snapshot)) { | ||
assert restoreInProgress.state().completed() == false : "completed restore should have been removed from cluster state"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assertion is not correct I think.
If a restore for a shard fails 5 times, it's marked as completed only in one of the next cluster state updates (see cleanupRestoreState)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion asserts that the restore in progress for the current allocation is not completed, so I think it's good? It will be marked later as you noticed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: we talked about this and Yannick is right, this assertion can be problematic on more busy clusters if a reroute kicks in between the moment the restore completes and the moment the restore is removed from the cluster state by the CleanRestoreStateTaskExecutor
if (restoreInProgress.snapshot().equals(snapshot)) { | ||
assert restoreInProgress.state().completed() == false : "completed restore should have been removed from cluster state"; | ||
RestoreInProgress.ShardRestoreStatus shardRestoreStatus = restoreInProgress.shards().get(shardRouting.shardId()); | ||
if (shardRestoreStatus.state() != RestoreInProgress.State.FAILURE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering if it's possible for shardRestoreStatus
to be null.
I think it can be if you restore from a snapshot, then the restore fails, and you retry another restore with a different subset of indices from that same snapshot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write this check as
if (shardRestoreStatus.state().completed() == false) {
and then add an assertion that shardRestoreStatus.state() != SUCCESS
(as the shard should have been moved to started and the recovery source cleaned up at that point).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering if it's possible for shardRestoreStatus to be null.
I think it can be if you restore from a snapshot, then the restore fails, and you retry another restore with a different subset of indices from that same snapshot.
Good catch, thanks!
} | ||
|
||
@Override | ||
public Decision canAllocate(final ShardRouting shardRouting, final RoutingNode node, final RoutingAllocation allocation) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add
@Override
public Decision canForceAllocatePrimary(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
assert shardRouting.primary() : "must not call canForceAllocatePrimary on a non-primary shard " + shardRouting;
return canAllocate(shardRouting, node, allocation);
}
as this is a hard constraint with no exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
RestoreInProgress.State.FAILURE, "recovery source type changed from snapshot to " + initializedShard.recoverySource())); | ||
} | ||
} | ||
|
||
@Override | ||
public void unassignedInfoUpdated(ShardRouting unassignedShard, UnassignedInfo newUnassignedInfo) { | ||
if (unassignedShard.primary()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only primaries can have a snapshot recovery source, so no need for this extra check here.
@ywelsch Thanks for the review! I updated the code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
if (restoreInProgress.snapshot().equals(snapshot)) { | ||
RestoreInProgress.ShardRestoreStatus shardRestoreStatus = restoreInProgress.shards().get(shardRouting.shardId()); | ||
if (shardRestoreStatus != null && shardRestoreStatus.state().completed() == false) { | ||
assert shardRestoreStatus.state() != RestoreInProgress.State.SUCCESS : "expected initializing shard"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also add the shardRestoreStatus state and the shard routing to the failure message here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do once the current CI build is finished
When the allocation of a shard has been retried too many times, the MaxRetryDecider is engaged to prevent any future allocation of the failed shard. If it happens while restoring a snapshot, the restore hangs and never completes because it stays around waiting for the shards to be assigned. It also blocks future attempts to restore the snapshot again. This commit changes the current behavior in order to fail the restore if a shard reached the maximum allocations attempts without being successfully assigned. This is the second part of the elastic#26865 issue. closes elastic#26865
b8476f6
to
8ced615
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
…27493) This commit changes the RestoreService so that it now fails the snapshot restore if one of the shards to restore has failed to be allocated. It also adds a new RestoreInProgressAllocationDecider that forbids such shards to be allocated again. This way, when a restore is impossible or failed too many times, the user is forced to take a manual action (like deleting the index which failed shards) in order to try to restore it again. This behaviour has been implemented because when the allocation of a shard has been retried too many times, the MaxRetryDecider is engaged to prevent any future allocation of the failed shard. If it happens while restoring a snapshot, the restore hanged and was never completed because it stayed around waiting for the shards to be assigned (and that won't happen). It also blocked future attempts to restore the snapshot again. With this commit, the restore does not hang and is marked as failed, leaving failed shards around for investigation. This is the second part of the #26865 issue. Closes #26865
…27493) This commit changes the RestoreService so that it now fails the snapshot restore if one of the shards to restore has failed to be allocated. It also adds a new RestoreInProgressAllocationDecider that forbids such shards to be allocated again. This way, when a restore is impossible or failed too many times, the user is forced to take a manual action (like deleting the index which failed shards) in order to try to restore it again. This behaviour has been implemented because when the allocation of a shard has been retried too many times, the MaxRetryDecider is engaged to prevent any future allocation of the failed shard. If it happens while restoring a snapshot, the restore hanged and was never completed because it stayed around waiting for the shards to be assigned (and that won't happen). It also blocked future attempts to restore the snapshot again. With this commit, the restore does not hang and is marked as failed, leaving failed shards around for investigation. This is the second part of the #26865 issue. Closes #26865
…27493) This commit changes the RestoreService so that it now fails the snapshot restore if one of the shards to restore has failed to be allocated. It also adds a new RestoreInProgressAllocationDecider that forbids such shards to be allocated again. This way, when a restore is impossible or failed too many times, the user is forced to take a manual action (like deleting the index which failed shards) in order to try to restore it again. This behaviour has been implemented because when the allocation of a shard has been retried too many times, the MaxRetryDecider is engaged to prevent any future allocation of the failed shard. If it happens while restoring a snapshot, the restore hanged and was never completed because it stayed around waiting for the shards to be assigned (and that won't happen). It also blocked future attempts to restore the snapshot again. With this commit, the restore does not hang and is marked as failed, leaving failed shards around for investigation. This is the second part of the #26865 issue. Closes #26865
…27493) This commit changes the RestoreService so that it now fails the snapshot restore if one of the shards to restore has failed to be allocated. It also adds a new RestoreInProgressAllocationDecider that forbids such shards to be allocated again. This way, when a restore is impossible or failed too many times, the user is forced to take a manual action (like deleting the index which failed shards) in order to try to restore it again. This behaviour has been implemented because when the allocation of a shard has been retried too many times, the MaxRetryDecider is engaged to prevent any future allocation of the failed shard. If it happens while restoring a snapshot, the restore hanged and was never completed because it stayed around waiting for the shards to be assigned (and that won't happen). It also blocked future attempts to restore the snapshot again. With this commit, the restore does not hang and is marked as failed, leaving failed shards around for investigation. This is the second part of the #26865 issue. Closes #26865
Backported to 6.x,6.1,6.0 and 5.6 |
When the allocation of a shard has been retried too many times, the
MaxRetryDecider
is engaged to prevent any future allocation of thefailed shard. If it happens while restoring a snapshot, the restore
hangs and never completes because it stays around waiting for the
shards to be assigned. It also blocks future attempts to restore the
snapshot again.
This commit changes the current behaviour in order to fail the restore if
a shard reached the maximum allocations attempts without being successfully
assigned.
This is the second part of the #26865 issue.