-
Notifications
You must be signed in to change notification settings - Fork 25.2k
CI: SearchableSnapshotsIntegTests testCreateAndRestoreSearchableSnapshot failing #65302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-distributed (Team:Distributed) |
I can reproduce this locally at a low(ish) rate with:
|
There's a number of funny races in the functionality around the pre-fetching here that can be reproduced via applying the following diff and running the failling test: diff --git a/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java b/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
index acf506cec7b..0217da0af81 100644
--- a/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
+++ b/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
@@ -457,6 +457,7 @@ public class SearchableSnapshotDirectory extends BaseDirectory {
logger.trace("{} warming cache for [{}] part [{}/{}]", shardId, file.physicalName(), part + 1, numberOfParts);
final long startTimeInNanos = statsCurrentTimeNanosSupplier.getAsLong();
+ TimeUnit.SECONDS.sleep(1L);
((CachedBlobContainerIndexInput) input).prefetchPart(part);
recoveryState.getIndex().addRecoveredBytesToFile(file.physicalName(), file.partBytes(part)); leading to failures via:
if you put that wait after the prefetching you get the test failure that we see here (or some variation of it):
and last but not least you might also get the following if you put the wait after the updating of the recovery stats:
mainly spelling these out here to make it clear that these failures all more or less come from the same set of race conditions. => On it looking into a fix :) |
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates elastic#65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)
The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes elastic#65302
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates elastic#65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates elastic#65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)
…hot (#65343) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes #65302
…hot (elastic#65343) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes elastic#65302
…hot (elastic#65343) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes elastic#65302
…hot (#65343) (#65351) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes #65302
…hot (#65343) (#65350) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes #65302
Might be related to #63586
Build scan:
https://gradle-enterprise.elastic.co/s/bmttlqvqd7ris
Repro line:
Reproduces locally?:
No.
Applicable branches:
Master
Failure excerpt:
The text was updated successfully, but these errors were encountered: