-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Fix SnapshotShardStatus Reporting for Failed Shard #48556
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -298,8 +298,10 @@ public void onResponse(String newGeneration) { | |
|
||
@Override | ||
public void onFailure(Exception e) { | ||
final String failure = ExceptionsHelper.stackTrace(e); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ew. Can we follow up with a change that keeps the exception as an exception rather than converting it to a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Jup, I'm happy to try :) |
||
snapshotStatus.moveToFailed(threadPool.absoluteTimeInMillis(), failure); | ||
logger.warn(() -> new ParameterizedMessage("[{}][{}] failed to snapshot shard", shardId, snapshot), e); | ||
notifyFailedSnapshotShard(snapshot, shardId, ExceptionsHelper.stackTrace(e)); | ||
notifyFailedSnapshotShard(snapshot, shardId, failure); | ||
} | ||
}); | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1219,6 +1219,12 @@ public void testDataNodeRestartWithBusyMasterDuringSnapshot() throws Exception { | |
disruption.startDisrupting(); | ||
logger.info("--> restarting data node, which should cause primary shards to be failed"); | ||
internalCluster().restartNode(dataNode, InternalTestCluster.EMPTY_CALLBACK); | ||
|
||
logger.info("--> wait for shard snapshots to show as failed"); | ||
assertBusy(() -> assertThat( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Before this change we would sometimes unblock the node and stop the disruption before the first shard failure. I think this change makes the test weaker. I'm guessing it's invalid to do this after There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'd argue that's a good thing :) <= The whole point of this test was to test this situation (failure on the data node before CS updates resume). The case where we stop disrupting before anything fails is probably practically impossible and even if it wasn't something that's covered in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not convinced yet. Practically impossible is not impossible enough for me :) Do you think that the failure in #48526 is also captured, rarely, by the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The situation we're running into is perfectly covered by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, yes, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd still rather adjust this test to have some reproducible testing for the concrete bug here and then enhance the |
||
client().admin().cluster().prepareSnapshotStatus("test-repo").setSnapshots("test-snap").get().getSnapshots() | ||
.get(0).getShardsStats().getFailedShards(), greaterThanOrEqualTo(1)), 60L, TimeUnit.SECONDS); | ||
|
||
unblockNode("test-repo", dataNode); | ||
disruption.stopDisrupting(); | ||
// check that snapshot completes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One fewer
instanceof
in the world ❤️