Skip to content

[CI] SearchableSnapshotsIntegTests.testCreateAndRestorePartialSearchableSnapshot fails #69336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dimitris-athanasiou opened this issue Feb 22, 2021 · 12 comments · Fixed by #69379
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@dimitris-athanasiou
Copy link
Contributor

Build scan: https://gradle-enterprise.elastic.co/s/6mvappg7o6hyo

Repro line:

./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestorePartialSearchableSnapshot" -Dtests.seed=1A6AFBD847C6A586 -Dtests.security.manager=true -Dtests.locale=is-IS -Dtests.timezone=Etc/GMT+0 -Druntime.java=8

Reproduces locally?: No

Applicable branches: master/7.x

Failure history: link

Failure excerpt:

java.lang.AssertionError: 
    Expected: <3948 hits>
         but: was null
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.junit.Assert.assertThat(Assert.java:923)
        at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.assertTotalHits(SearchableSnapshotsIntegTests.java:1366)
        at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestorePartialSearchableSnapshot(SearchableSnapshotsIntegTests.java:651)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=474, name=Thread-34, state=RUNNABLE, group=TGRP-SearchableSnapshotsIntegTests]

        Caused by:
        Failed to execute phase [query], all shards failed; shardFailures {[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][0]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][1]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][2]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][3]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][4]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][5]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][6]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][7]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][8]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}{[VSgyofNoS4mr27PoSm_LaQ][fcydaumqqm][9]: NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]}
            at __randomizedtesting.SeedInfo.seed([1A6AFBD847C6A586]:0)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:636)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:357)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:669)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:440)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$3(AbstractSearchAsyncAction.java:301)
            at org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:333)
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
            at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:33)
            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)

            Caused by:
            NodeNotConnectedException[[node_s3][127.0.0.1:38229] Node not connected]
                at org.elasticsearch.transport.ClusterConnectionManager.getConnection(ClusterConnectionManager.java:178)
                at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:787)
                at org.elasticsearch.action.search.SearchTransportService.getConnection(SearchTransportService.java:375)
                at org.elasticsearch.action.search.TransportSearchAction.lambda$buildConnectionLookup$9(TransportSearchAction.java:698)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.getConnection(AbstractSearchAsyncAction.java:674)
                at org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction.executePhaseOnShard(SearchQueryThenFetchAsyncAction.java:71)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$4(AbstractSearchAsyncAction.java:273)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction$PendingExecutions.tryRun(AbstractSearchAsyncAction.java:758)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.performPhaseOnShard(AbstractSearchAsyncAction.java:308)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.run(AbstractSearchAsyncAction.java:226)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:397)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:184)
                at org.elasticsearch.action.search.TransportSearchAction$5.run(TransportSearchAction.java:779)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:397)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:391)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:669)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:440)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$000(AbstractSearchAsyncAction.java:62)
                at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:289)
                at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:48)
                at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:745)
                at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1290)
                at org.elasticsearch.transport.TransportService$9.run(TransportService.java:1152)
                at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)
@dimitris-athanasiou dimitris-athanasiou added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Feb 22, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 22, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear
Copy link
Member

This reproduces locally pretty easily after a few iterations.

The deeper issue is this tripped assertion killing the ndoe:

feb 22, 2021 11:33:31 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s3][generic][T#1],5,TGRP-SearchableSnapshotsIntegTests]
java.lang.AssertionError: StoredFieldsReader are only supposed to be consumed in the thread in which they have been acquired. But was acquired in Thread[elasticsearch[node_s3][generic][T#3],5,TGRP-SearchableSnapshotsIntegTests] and consumed in Thread[elasticsearch[node_s3][generic][T#1],5,TGRP-SearchableSnapshotsIntegTests].
	at __randomizedtesting.SeedInfo.seed([1A6AFBD847C6A586]:0)
	at org.apache.lucene.codecs.asserting.AssertingCodec.assertThread(AssertingCodec.java:40)
	at org.apache.lucene.codecs.asserting.AssertingStoredFieldsFormat$AssertingStoredFieldsReader.visitDocument(AssertingStoredFieldsFormat.java:76)
	at org.elasticsearch.index.engine.LuceneChangesSnapshot.readDocAsOp(LuceneChangesSnapshot.java:259)
	at org.elasticsearch.index.engine.LuceneChangesSnapshot.next(LuceneChangesSnapshot.java:120)
	at org.elasticsearch.indices.recovery.RecoverySourceHandler$OperationBatchSender.nextChunkRequest(RecoverySourceHandler.java:740)
	at org.elasticsearch.indices.recovery.RecoverySourceHandler$OperationBatchSender.nextChunkRequest(RecoverySourceHandler.java:705)
	at org.elasticsearch.indices.recovery.MultiChunkTransfer.getNextRequest(MultiChunkTransfer.java:151)
	at org.elasticsearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:102)
	at org.elasticsearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:67)

dimitris-athanasiou added a commit that referenced this issue Feb 22, 2021
... in SearchableSnapshotsIntegTests.

Relates #69336
dimitris-athanasiou added a commit that referenced this issue Feb 22, 2021
... in SearchableSnapshotsIntegTests.

Relates #69336

Backport of #69337
@original-brownbear
Copy link
Member

This was/is caused by #68961, we are sharing a stored fields reader across threads now which trips an assertion. I'm looking into this

@original-brownbear
Copy link
Member

@nhat could you have a look here? It's pretty obvious that the way the multi chunk transfer works can lead to the stored fields reader being shared across threads but what the correct fix to this is isn't so obvious to me :)

@dimitris-athanasiou
Copy link
Contributor Author

@original-brownbear A few more tests in that suite seem to be failing like in https://gradle-enterprise.elastic.co/s/x4r4l3kmjrnci. They seem related. Do you think I should mute them all?

@original-brownbear
Copy link
Member

@dimitris-athanasiou I think the problem is that one test failing kills the rest of the suite because it's a shared test cluster, there's also another failure here that is unrelated that I'll open an issue for shortly.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 22, 2021

I've muted all SearchableSnapshotsIntegTests on master and 7.x

@mark-vieira
Copy link
Contributor

I've cherry-picked the mute to the 7.12 branch as well since I'm seeing loads of failed build there as well.

@dakrone
Copy link
Member

dakrone commented Mar 31, 2021

This failed again for me today in a PR build: https://gradle-enterprise.elastic.co/s/afswubohb2bse

java.lang.AssertionError: [heampsxien][6], node[NtT4U7KMRz2pdpIjqHVGnQ], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=J7CiN0w_S9iPCczH4NbdZQ], unassigned_info[[reason=NEW_INDEX_RESTORED], at[2021-03-31T16:34:24.644Z], delayed=false, details[restore_source[QbeMFFxHQT/fkbisrxmgw]], allocation_status[no_attempt]]
Expected: a value equal to or greater than <23122L>
     but: <0L> was less than <23122L>
	at __randomizedtesting.SeedInfo.seed([36BF6811FFF6C8B7:4238A9C0C291E655]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.junit.Assert.assertThat(Assert.java:956)
	at org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestorePartialSearchableSnapshot(SearchableSnapshotsIntegTests.java:570)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)

It did not reproduce for me however. The line is:

./gradlew ':x-pack:plugin:searchable-snapshots:internalClusterTest' --tests "org.elasticsearch.xpack.searchablesnapshots.SearchableSnapshotsIntegTests.testCreateAndRestorePartialSearchableSnapshot" -Dtests.seed=36BF6811FFF6C8B7 -Dtests.security.manager=true -Dtests.locale=sv-SE -Dtests.timezone=Europe/Sofia

@dakrone dakrone reopened this Mar 31, 2021
@mark-vieira
Copy link
Contributor

Thanks for reopening @dakrone. You beat me to it #71132.

FYI, as per my comment in the duplicate issue, I've muted this test in master and 7.x.

@ywelsch
Copy link
Contributor

ywelsch commented Apr 1, 2021

Note that this is a different failure cause, so let's open a fresh issue next time (as Mark did). I'm closing this one, and reopening Mark's one instead.

@ywelsch ywelsch closed this as completed Apr 1, 2021
@mark-vieira
Copy link
Contributor

Note that this is a different failure cause, so let's open a fresh issue next time

Good call. It wasn't obvious to me it was a new underlying cause but I think going forward it's best to just assume it is rather than reopen old issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants