Skip to content

CI: SearchableSnapshotsIntegTests testCreateAndRestoreSearchableSnapshot failing #65302

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
costin opened this issue Nov 20, 2020 · 3 comments · Fixed by #65343
Closed

CI: SearchableSnapshotsIntegTests testCreateAndRestoreSearchableSnapshot failing #65302

costin opened this issue Nov 20, 2020 · 3 comments · Fixed by #65343
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@costin
Copy link
Member

costin commented Nov 20, 2020

Might be related to #63586

Build scan:
https://gradle-enterprise.elastic.co/s/bmttlqvqd7ris

Repro line:

Execution failed for task ':x-pack:plugin:searchable-snapshots:internalClusterTest'.

REPRODUCE WITH: ./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot" \
  -Dtests.seed=D2761B7AE386973C \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-KW \
  -Dtests.timezone=America/Ojinaga \
  -Druntime.java=11

REPRODUCE WITH: ./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot" \
  -Dtests.seed=D2761B7AE386973C \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-KW \
  -Dtests.timezone=America/Ojinaga \
  -Druntime.java=11

Reproduces locally?:

No.

Applicable branches:
Master

Failure excerpt:

10:45:30   1> [2020-11-20T01:45:10,835][INFO ][o.e.n.Node               ] [testCreateAndRestoreSearchableSnapshot] stopping ...
10:45:30   1> [2020-11-20T01:45:10,836][INFO ][o.e.c.s.MasterService    ] [node_s2] node-left[{node_s1}{6JtW_0TvT_-Ck2oA4xoG8A}{SXhlp6xcQriwvmoTqJgmvg}{127.0.0.1}{127.0.0.1:44875}{imr}{xpack.installed=true} reason: disconnected], term: 1, version: 104, delta: removed {{node_s1}{6JtW_0TvT_-Ck2oA4xoG8A}{SXhlp6xcQriwvmoTqJgmvg}{127.0.0.1}{127.0.0.1:44875}{imr}{xpack.installed=true}}
10:45:30   1> [2020-11-20T01:45:10,814][WARN ][o.e.c.NodeConnectionsService] [node_s2] failed to connect to {node_s1}{6JtW_0TvT_-Ck2oA4xoG8A}{SXhlp6xcQriwvmoTqJgmvg}{127.0.0.1}{127.0.0.1:44875}{imr}{xpack.installed=true} (tried [1] times)
10:45:30   1> org.elasticsearch.transport.ConnectTransportException: [node_s1][127.0.0.1:44875] connect_exception
10:45:30   1> 	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:963) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:198) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]
10:45:30   1> 	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]
10:45:30   1> 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]
10:45:30   1> 	at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) ~[?:?]
10:45:30   1> 	at org.elasticsearch.common.concurrent.CompletableContext.completeExceptionally(CompletableContext.java:57) ~[elasticsearch-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.SocketChannelContext.connect(SocketChannelContext.java:142) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.EventHandler.handleConnect(EventHandler.java:117) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.transport.nio.TestEventHandler.handleConnect(TestEventHandler.java:123) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.attemptConnect(NioSelector.java:428) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.registerChannel(NioSelector.java:451) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.setUpNewChannels(NioSelector.java:440) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.preSelect(NioSelector.java:261) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:155) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:131) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	at java.lang.Thread.run(Thread.java:834) [?:?]
10:45:30   1> Caused by: java.net.ConnectException: Connection refused
10:45:30   1> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
10:45:30   1> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:779) ~[?:?]
10:45:30   1> 	at org.elasticsearch.nio.SocketChannelContext.connect(SocketChannelContext.java:139) ~[elasticsearch-nio-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
10:45:30   1> 	... 9 more
@costin costin added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Nov 20, 2020
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Nov 20, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear original-brownbear self-assigned this Nov 20, 2020
@original-brownbear
Copy link
Contributor

I can reproduce this locally at a low(ish) rate with:

REPRODUCE WITH: ./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot" -Dtests.seed=ABBA66EA22263E7E -Dtests.locale=hi -Dtests.timezone=Canada/Mountain -Druntime.java=15

@original-brownbear
Copy link
Contributor

There's a number of funny races in the functionality around the pre-fetching here that can be reproduced via applying the following diff and running the failling test:

diff --git a/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java b/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
index acf506cec7b..0217da0af81 100644
--- a/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
+++ b/x-pack/plugin/searchable-snapshots/src/main/java/org/elasticsearch/index/store/SearchableSnapshotDirectory.java
@@ -457,6 +457,7 @@ public class SearchableSnapshotDirectory extends BaseDirectory {
 
                         logger.trace("{} warming cache for [{}] part [{}/{}]", shardId, file.physicalName(), part + 1, numberOfParts);
                         final long startTimeInNanos = statsCurrentTimeNanosSupplier.getAsLong();
+                        TimeUnit.SECONDS.sleep(1L);
                         ((CachedBlobContainerIndexInput) input).prefetchPart(part);
                         recoveryState.getIndex().addRecoveredBytesToFile(file.physicalName(), file.partBytes(part));

leading to failures via:

نوف 23, 2020 9:06:06 ص com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s3][searchable_snapshots_cache_prewarming][T#9],5,TGRP-SearchableSnapshotsIntegTests]
java.lang.AssertionError: listener already exists CacheFileReference{cacheKey='[snapshotId=qnkalgsyjp/fmGfEaHpRNGQszG3JWNO0w, indexId=[cphavvfupq/gD0DHtaRRvO4emlhKT1JPQ], shardId=[wpuxikeoah][1], fileName='_2_5.fnm']', fileLength=1129, acquired=false}
	at __randomizedtesting.SeedInfo.seed([D2761B7AE386973C]:0)
	at org.elasticsearch.index.store.cache.CacheFile.acquire(CacheFile.java:152)
	at org.elasticsearch.index.store.cache.CachedBlobContainerIndexInput$CacheFileReference.get(CachedBlobContainerIndexInput.java:781)
	at org.elasticsearch.index.store.cache.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:441)
	at org.elasticsearch.index.store.SearchableSnapshotDirectory.lambda$prewarmCache$11(SearchableSnapshotDirectory.java:461)
	at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:733)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

نوف 23, 2020 9:06:06 ص com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s3][searchable_snapshots_cache_prewarming][T#1],5,TGRP-SearchableSnapshotsIntegTests]
java.lang.AssertionError: listener already exists CacheFileReference{cacheKey='[snapshotId=qnkalgsyjp/fmGfEaHpRNGQszG3JWNO0w, indexId=[cphavvfupq/gD0DHtaRRvO4emlhKT1JPQ], shardId=[wpuxikeoah][1], fileName='_6.cfe']', fileLength=411, acquired=false}
	at __randomizedtesting.SeedInfo.seed([D2761B7AE386973C]:0)
	at org.elasticsearch.index.store.cache.CacheFile.acquire(CacheFile.java:152)
	at org.elasticsearch.index.store.cache.CachedBlobContainerIndexInput$CacheFileReference.get(CachedBlobContainerIndexInput.java:781)
	at org.elasticsearch.index.store.cache.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:441)
	at org.elasticsearch.index.store.SearchableSnapshotDirectory.lambda$prewarmCache$11(SearchableSnapshotDirectory.java:461)
	at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:45)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:733)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
	at java.base/java.lang.Thread.run(Thread.java:832)

if you put that wait after the prefetching you get the test failure that we see here (or some variation of it):


java.lang.AssertionError: {
  "id" : 1,
  "type" : "SNAPSHOT",
  "stage" : "FINALIZE",
  "primary" : true,
  "start_time" : "2020-11-22T22:09:24.151Z",
  "start_time_in_millis" : 1606082964151,
  "total_time" : "495ms",
  "total_time_in_millis" : 495,
  "source" : {
    "repository" : "ywUJTtzmUr",
    "snapshot" : "qnkalgsyjp",
    "version" : "8.0.0",
    "index" : "cphavvfupq",
    "restoreUUID" : "aWZR-4ZjQYO0Vjm8XeYbrw"
  },
  "target" : {
    "id" : "I7VmkNOeR9Ol4l8xrNIPlw",
    "host" : "127.0.0.1",
    "transport_address" : "127.0.0.1:36753",
    "ip" : "127.0.0.1",
    "name" : "node_s3"
  },
  "index" : {
    "size" : {
      "total" : "918.5kb",
      "total_in_bytes" : 940544,
      "reused" : "3.4kb",
      "reused_in_bytes" : 3543,
      "recovered" : "0b",
      "recovered_in_bytes" : 0,
      "percent" : "0.0%"
    },
    "files" : {
      "total" : 26,
      "reused" : 8,
      "recovered" : 0,
      "percent" : "0.0%"
    },
    "total_time" : "495ms",
    "total_time_in_millis" : 495,
    "source_throttle_time" : "-1",
    "source_throttle_time_in_millis" : 0,
    "target_throttle_time" : "-1",
    "target_throttle_time_in_millis" : 0
  },
  "translog" : {
    "recovered" : 0,
    "total" : 0,
    "percent" : "100.0%",
    "total_on_start" : 0,
    "total_time" : "202ms",
    "total_time_in_millis" : 202
  },
  "verify_index" : {
    "check_index_time" : "0s",
    "check_index_time_in_millis" : 0,
    "total_time" : "0s",
    "total_time_in_millis" : 0
  }
}
Expected: <18>
     but: was <0>
Expected :<18>

and last but not least you might also get the following if you put the wait after the updating of the recovery stats:

java.lang.AssertionError: 
Expected: <DONE>
     but: was <FINALIZE>
Expected :<DONE>
Actual   :<FINALIZE>
<Click to see difference>


	at __randomizedtesting.SeedInfo.seed([D2761B7AE386973C:58156C017011C68F]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.junit.Assert.assertThat(Assert.java:923)
	at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.assertRecoveryStats(SearchableSnapshotsIntegTests.java:731)
	at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestoreSearchableSnapshot(SearchableSnapshotsIntegTests.java:211)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:564)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)

mainly spelling these out here to make it clear that these failures all more or less come from the same set of race conditions. => On it looking into a fix :)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 23, 2020
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in
was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal
consistency assertions.

Relates elastic#65302 (does not fix it though, but reduces noise from failures by removing secondary
tripped assertions after the test fails)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 23, 2020
The recovery stats assertions in this test ran without any waiting for
the recoveries to actually finish. The fact that they ran after the concurrent
searches checks generally meant that they would pass (because of searches warming caches
+ general relative slowness of searches) but there is no hard guarantees this will work
reliably as the pre-fetch threads which will update the recovery state might still be slow
to do so randomly, causing the assertions to trip.

closes elastic#65302
original-brownbear added a commit that referenced this issue Nov 23, 2020
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in
was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal
consistency assertions.

Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary
tripped assertions after the test fails)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 23, 2020
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in
was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal
consistency assertions.

Relates elastic#65302 (does not fix it though, but reduces noise from failures by removing secondary
tripped assertions after the test fails)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 23, 2020
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in
was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal
consistency assertions.

Relates elastic#65302 (does not fix it though, but reduces noise from failures by removing secondary
tripped assertions after the test fails)
original-brownbear added a commit that referenced this issue Nov 23, 2020
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in
was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal
consistency assertions.

Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary
tripped assertions after the test fails)
original-brownbear added a commit that referenced this issue Nov 23, 2020
If we fail to create the `FileChannelReference` (e.g. because the directory it should be created in
was deleted in a test) we have to remove the listener from the `listeners` set to not trip internal
consistency assertions.

Relates #65302 (does not fix it though, but reduces noise from failures by removing secondary
tripped assertions after the test fails)
original-brownbear added a commit that referenced this issue Nov 23, 2020
…hot (#65343)

The recovery stats assertions in this test ran without any waiting for
the recoveries to actually finish. The fact that they ran after the concurrent
searches checks generally meant that they would pass (because of searches warming caches
+ general relative slowness of searches) but there is no hard guarantees this will work
reliably as the pre-fetch threads which will update the recovery state might still be slow
to do so randomly, causing the assertions to trip.

closes #65302
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 23, 2020
…hot (elastic#65343)

The recovery stats assertions in this test ran without any waiting for
the recoveries to actually finish. The fact that they ran after the concurrent
searches checks generally meant that they would pass (because of searches warming caches
+ general relative slowness of searches) but there is no hard guarantees this will work
reliably as the pre-fetch threads which will update the recovery state might still be slow
to do so randomly, causing the assertions to trip.

closes elastic#65302
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Nov 23, 2020
…hot (elastic#65343)

The recovery stats assertions in this test ran without any waiting for
the recoveries to actually finish. The fact that they ran after the concurrent
searches checks generally meant that they would pass (because of searches warming caches
+ general relative slowness of searches) but there is no hard guarantees this will work
reliably as the pre-fetch threads which will update the recovery state might still be slow
to do so randomly, causing the assertions to trip.

closes elastic#65302
original-brownbear added a commit that referenced this issue Nov 23, 2020
…hot (#65343) (#65351)

The recovery stats assertions in this test ran without any waiting for
the recoveries to actually finish. The fact that they ran after the concurrent
searches checks generally meant that they would pass (because of searches warming caches
+ general relative slowness of searches) but there is no hard guarantees this will work
reliably as the pre-fetch threads which will update the recovery state might still be slow
to do so randomly, causing the assertions to trip.

closes #65302
original-brownbear added a commit that referenced this issue Nov 23, 2020
…hot (#65343) (#65350)

The recovery stats assertions in this test ran without any waiting for
the recoveries to actually finish. The fact that they ran after the concurrent
searches checks generally meant that they would pass (because of searches warming caches
+ general relative slowness of searches) but there is no hard guarantees this will work
reliably as the pre-fetch threads which will update the recovery state might still be slow
to do so randomly, causing the assertions to trip.

closes #65302
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
3 participants