CI: SearchableSnapshotsIntegTests testCreateAndRestoreSearchableSnapshot failing #65302

costin · 2020-11-20T09:14:35Z

Might be related to #63586

Build scan:
https://gradle-enterprise.elastic.co/s/bmttlqvqd7ris

Repro line:

Execution failed for task ':x-pack:plugin:searchable-snapshots:internalClusterTest'.

REPRODUCE WITH: ./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot" \
  -Dtests.seed=D2761B7AE386973C \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-KW \
  -Dtests.timezone=America/Ojinaga \
  -Druntime.java=11

REPRODUCE WITH: ./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot" \
  -Dtests.seed=D2761B7AE386973C \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-KW \
  -Dtests.timezone=America/Ojinaga \
  -Druntime.java=11

Reproduces locally?:

No.

Applicable branches:
Master

Failure excerpt:

10:45:30   1> [2020-11-20T01:45:10,835][INFO ][o.e.n.Node               ] [testCreateAndRestoreSearchableSnapshot] stopping ...
10:45:30   1> [2020-11-20T01:45:10,836][INFO ][o.e.c.s.MasterService    ] [node_s2] node-left[{node_s1}{6JtW_0TvT_-Ck2oA4xoG8A}{SXhlp6xcQriwvmoTqJgmvg}{127.0.0.1}{127.0.0.1:44875}{imr}{xpack.installed=true} reason: disconnected], term: 1, version: 104, delta: removed {{node_s1}{6JtW_0TvT_-Ck2oA4xoG8A}{SXhlp6xcQriwvmoTqJgmvg}{127.0.0.1}{127.0.0.1:44875}{imr}{xpack.installed=true}}
10:45:30   1> [2020-11-20T01:45:10,814][WARN ][o.e.c.NodeConnectionsService] [node_s2] failed to connect to {node_s1}{6JtW_0TvT_-Ck2oA4xoG8A}{SXhlp6xcQriwvmoTqJgmvg}{127.0.0.1}{127.0.0.1:44875}{imr}{xpack.installed=true} (tried [1] times)
10:45:30   1> org.elasticsearch.transport.ConnectTransportException: [node_s1][127.0.0.1:44875] connect_exception
10:45:30   1> 	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:963) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
10:45:30   1> 	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
10:45:30   1> 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
10:45:30   1> 	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
10:45:30   1> 	at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.SocketChannelContext.connect(SocketChannelContext.java:142) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.EventHandler.handleConnect(EventHandler.java:117) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.transport.nio.TestEventHandler.handleConnect(TestEventHandler.java:123) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.attemptConnect(NioSelector.java:428) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.registerChannel(NioSelector.java:451) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.setUpNewChannels(NioSelector.java:440) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.preSelect(NioSelector.java:261) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:155) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:131) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at java.lang.Thread.run(Thread.java:834) [?:?]
10:45:30   1> Caused by: java.net.ConnectException: Connection refused
10:45:30   1> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
10:45:30   1> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
10:45:30   1> 	at org.elasticsearch.nio.SocketChannelContext.connect(SocketChannelContext.java:139) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	... 9 more

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-11-20T09:14:38Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2020-11-22T21:00:14Z

I can reproduce this locally at a low(ish) rate with:

REPRODUCE WITH: ./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot" -Dtests.seed=ABBA66EA22263E7E -Dtests.locale=hi -Dtests.timezone=Canada/Mountain -Druntime.java=15

original-brownbear · 2020-11-22T22:13:01Z

There's a number of funny races in the functionality around the pre-fetching here that can be reproduced via applying the following diff and running the failling test:

diff --git a/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java b/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
index acf506cec7b..0217da0af81 100644
--- a/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
+++ b/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
@@ -457,6 +457,7 @@ public class SearchableSnapshotDirectory extends BaseDirectory {
 
                         logger.trace("{} warming cache for [{}] part [{}/{}]", shardId, file.physicalName(), part + 1, numberOfParts);
                         final long startTimeInNanos = statsCurrentTimeNanosSupplier.getAsLong();
+                        TimeUnit.SECONDS.sleep(1L);
                         ((CachedBlobContainerIndexInput) input).prefetchPart(part);
                         recoveryState.getIndex().addRecoveredBytesToFile(file.physicalName(), file.partBytes(part));

leading to failures via:

نوف 23, 2020 9:06:06 ص com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s3][searchable_snapshots_cache_prewarming][T#9],5,TGRP-SearchableSnapshotsIntegTests]
java.lang.AssertionError: listener already exists CacheFileReference{cacheKey='[snapshotId=qnkalgsyjp/fmGfEaHpRNGQszG3JWNO0w, indexId=[cphavvfupq/gD0DHtaRRvO4emlhKT1JPQ], shardId=[wpuxikeoah][1], fileName='_2_5.fnm']', fileLength=1129, acquired=false}
	at __randomizedtesting.SeedInfo.seed([D2761B7AE386973C]:0)
	at org.elasticsearch.index.store.cache.CacheFile.acquire(CacheFile.java:152)
	at org.elasticsearch.index.store.cache.CachedBlobContainerIndexInput$CacheFileReference.get(CachedBlobContainerIndexInput.java:781)
	at org.elasticsearch.index.store.cache.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:441)
	at org.elasticsearch.index.store.SearchableSnapshotDirectory.lambda$prewarmCache$11(SearchableSnapshotDirectory.java:461)
	at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:733)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

نوف 23, 2020 9:06:06 ص com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s3][searchable_snapshots_cache_prewarming][T#1],5,TGRP-SearchableSnapshotsIntegTests]
java.lang.AssertionError: listener already exists CacheFileReference{cacheKey='[snapshotId=qnkalgsyjp/fmGfEaHpRNGQszG3JWNO0w, indexId=[cphavvfupq/gD0DHtaRRvO4emlhKT1JPQ], shardId=[wpuxikeoah][1], fileName='_6.cfe']', fileLength=411, acquired=false}
	at __randomizedtesting.SeedInfo.seed([D2761B7AE386973C]:0)
	at org.elasticsearch.index.store.cache.CacheFile.acquire(CacheFile.java:152)
	at org.elasticsearch.index.store.cache.CachedBlobContainerIndexInput$CacheFileReference.get(CachedBlobContainerIndexInput.java:781)
	at org.elasticsearch.index.store.cache.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:441)
	at org.elasticsearch.index.store.SearchableSnapshotDirectory.lambda$prewarmCache$11(SearchableSnapshotDirectory.java:461)
	at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:733)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

if you put that wait after the prefetching you get the test failure that we see here (or some variation of it):


java.lang.AssertionError: {
  "id" : 1,
  "type" : "SNAPSHOT",
  "stage" : "FINALIZE",
  "primary" : true,
  "start_time" : "2020-11-22T22:09:24.151Z",
  "start_time_in_millis" : 1606082964151,
  "total_time" : "495ms",
  "total_time_in_millis" : 495,
  "source" : {
    "repository" : "ywUJTtzmUr",
    "snapshot" : "qnkalgsyjp",
    "version" : "8.0.0",
    "index" : "cphavvfupq",
    "restoreUUID" : "aWZR-4ZjQYO0Vjm8XeYbrw"
  },
  "target" : {
    "id" : "I7VmkNOeR9Ol4l8xrNIPlw",
    "host" : "127.0.0.1",
    "transport_address" : "127.0.0.1:36753",
    "ip" : "127.0.0.1",
    "name" : "node_s3"
  },
  "index" : {
    "size" : {
      "total" : "918.5kb",
      "total_in_bytes" : 940544,
      "reused" : "3.4kb",
      "reused_in_bytes" : 3543,
      "recovered" : "0b",
      "recovered_in_bytes" : 0,
      "percent" : "0.0%"
    },
    "files" : {
      "total" : 26,
      "reused" : 8,
      "recovered" : 0,
      "percent" : "0.0%"
    },
    "total_time" : "495ms",
    "total_time_in_millis" : 495,
    "source_throttle_time" : "-1",
    "source_throttle_time_in_millis" : 0,
    "target_throttle_time" : "-1",
    "target_throttle_time_in_millis" : 0
  },
  "translog" : {
    "recovered" : 0,
    "total" : 0,
    "percent" : "100.0%",
    "total_on_start" : 0,
    "total_time" : "202ms",
    "total_time_in_millis" : 202
  },
  "verify_index" : {
    "check_index_time" : "0s",
    "check_index_time_in_millis" : 0,
    "total_time" : "0s",
    "total_time_in_millis" : 0
  }
}
Expected: <18>
     but: was <0>
Expected :<18>

and last but not least you might also get the following if you put the wait after the updating of the recovery stats:

java.lang.AssertionError: 
Expected: <DONE>
     but: was <FINALIZE>
Expected :<DONE>
Actual   :<FINALIZE>
<Click to see difference>


	at __randomizedtesting.SeedInfo.seed([D2761B7AE386973C:58156C017011C68F]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.assertRecoveryStats(SearchableSnapshotsIntegTests.java:731)
	at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot(SearchableSnapshotsIntegTests.java:211)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)

mainly spelling these out here to make it clear that these failures all more or less come from the same set of race conditions. => On it looking into a fix :)

If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates elastic#65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)

The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes elastic#65302

If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)

If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates elastic#65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)

If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal consistency assertions. Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary tripped assertions after the test fails)

…hot (#65343) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes #65302

…hot (elastic#65343) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes elastic#65302

…hot (#65343) (#65351) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes #65302

…hot (#65343) (#65350) The recovery stats assertions in this test ran without any waiting for the recoveries to actually finish. The fact that they ran after the concurrent searches checks generally meant that they would pass (because of searches warming caches + general relative slowness of searches) but there is no hard guarantees this will work reliably as the pre-fetch threads which will update the recovery state might still be slow to do so randomly, causing the assertions to trip. closes #65302

costin added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Nov 20, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 20, 2020

original-brownbear self-assigned this Nov 20, 2020

original-brownbear mentioned this issue Nov 23, 2020

Fix Broken Error Handling in CacheFile#acquire #65342

Merged

original-brownbear mentioned this issue Nov 23, 2020

Fix SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot #65343

Merged

original-brownbear mentioned this issue Nov 23, 2020

Fix Broken Error Handling in CacheFile#acquire (#65342) #65346

Merged

original-brownbear mentioned this issue Nov 23, 2020

Fix Broken Error Handling in CacheFile#acquire (#65342) #65347

Merged

original-brownbear closed this as completed in #65343 Nov 23, 2020

original-brownbear mentioned this issue Nov 23, 2020

Fix SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot (#65343) #65350

Merged

original-brownbear mentioned this issue Nov 23, 2020

Fix SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot (#65343) #65351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI: SearchableSnapshotsIntegTests testCreateAndRestoreSearchableSnapshot failing #65302

CI: SearchableSnapshotsIntegTests testCreateAndRestoreSearchableSnapshot failing #65302

costin commented Nov 20, 2020

elasticmachine commented Nov 20, 2020

Uh oh!

original-brownbear commented Nov 22, 2020

Uh oh!

original-brownbear commented Nov 22, 2020

Uh oh!

CI: SearchableSnapshotsIntegTests testCreateAndRestoreSearchableSnapshot failing #65302

CI: SearchableSnapshotsIntegTests testCreateAndRestoreSearchableSnapshot failing #65302

Comments

costin commented Nov 20, 2020

elasticmachine commented Nov 20, 2020

Uh oh!

original-brownbear commented Nov 22, 2020

Uh oh!

original-brownbear commented Nov 22, 2020

Uh oh!