Skip to content

[CI] SearchWithRandomExceptionsIT timeout #40435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tvernum opened this issue Mar 26, 2019 · 11 comments · Fixed by #43523
Closed

[CI] SearchWithRandomExceptionsIT timeout #40435

tvernum opened this issue Mar 26, 2019 · 11 comments · Fixed by #43523
Assignees
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI

Comments

@tvernum
Copy link
Contributor

tvernum commented Mar 26, 2019

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/481/

Does not reproduce

./gradlew :server:integTest \
  -Dtests.seed=DC4EDCAF00008763 \
  -Dtests.class=org.elasticsearch.search.basic.SearchWithRandomExceptionsIT \
  -Dtests.security.manager=true \
  -Dtests.jvm.argline="-XX:-UseConcMarkSweepGC -XX:+UseG1GC" \
  -Dtests.locale=en-US \
  -Dtests.timezone=Etc/UTC \
  -Dcompiler.java=12 \
  -Druntime.java=8
02:38:57 HEARTBEAT J3 PID(64280@elasticsearch-ci-immutable-ubuntu-1404-1553564409042061385): 2019-03-26T02:38:57, stalled for 1192s at: SearchWithRandomExceptionsIT.testRandomExceptions
02:39:14   2> 三月 26, 2019 3:19:07 上午 com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
02:39:14 Suite: org.elasticsearch.search.basic.SearchWithRandomExceptionsIT
02:39:14   2> WARNING: Uncaught exception in thread: Thread[elasticsearch[node_s0][write][T#2],5,TGRP-SearchWithRandomExceptionsIT]
02:39:14   1> [2019-03-26T03:19:04,943][INFO ][o.e.s.b.SearchWithRandomExceptionsIT] [testRandomExceptions] before test
# ...
2> WARNING: Suite execution timed out: org.elasticsearch.search.basic.SearchWithRandomExceptionsIT

481-console.txt.gz

@tvernum tvernum added :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI labels Mar 26, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search

@jimczi jimczi self-assigned this May 27, 2019
@davidkyle
Copy link
Member

Another instance https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=ubuntu-16.04&&immutable/85/console

Log file: https://scans.gradle.com/s/g2vrt5bbrrjco/tests/kyv2y2z3r4v7m-xng72kg7hqste

Scanning the log file it looks like the test never completes as the thread is blocked on an index request that never returns.

"TEST-SearchWithRandomExceptionsIT.testRandomExceptions-seed#[FFFBCCE60327FD5C]" ID=2905 WAITING on org.elasticsearch.common.util.concurrent.BaseFuture$Sync@11de2293
	at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
	- waiting on org.elasticsearch.common.util.concurrent.BaseFuture$Sync@11de2293
	at [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
	at [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
	at app//org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:252)
	at app//org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:87)
	at app//org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:57)
	at app//org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:34)
	at app//org.elasticsearch.action.ActionRequestBuilder.get(ActionRequestBuilder.java:52)
	at app//org.elasticsearch.search.basic.SearchWithRandomExceptionsIT.testRandomExceptions(SearchWithRandomExceptionsIT.java:122)

@original-brownbear
Copy link
Member

original-brownbear commented Jul 9, 2019

The reason for this test to block is an assertion tripping during a replica request:

Jul 10, 2019 2:38:49 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_td3][write][T#1],5,TGRP-SearchWithRandomExceptionsIT]
java.lang.AssertionError: expected mapper parsing failures. got java.io.IOException: Forced top level Exception on [Terms]
	at __randomizedtesting.SeedInfo.seed([9232C9F794FC94FD]:0)
	at org.elasticsearch.action.support.replication.TransportWriteAction.syncOperationResultOrThrow(TransportWriteAction.java:78)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:418)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:393)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:79)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:526)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:500)
	at org.elasticsearch.index.shard.IndexShard.lambda$innerAcquireReplicaOperationPermit$21(IndexShard.java:2795)
	at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:112)
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:269)
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:236)
	at org.elasticsearch.index.shard.IndexShard.lambda$acquireReplicaOperationPermit$19(IndexShard.java:2734)
	at org.elasticsearch.index.shard.IndexShard.innerAcquireReplicaOperationPermit(IndexShard.java:2834)
	at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2733)
	at org.elasticsearch.action.support.replication.TransportReplicationAction.acquireReplicaOperationPermit(TransportReplicationAction.java:876)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:585)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at org.elasticsearch.action.support.replication.TransportReplicationAction.handleReplicaRequest(TransportReplicationAction.java:485)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:267)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:757)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:835)

That assertion tripping prevents a listener from being called and that causes the whole test to time out eventually after already having failed.
I can reproduce this pretty easily locally running the test in a loop.

A really good reproducer seed for me is -Dtests.seed=9232C9F794FC94FD (mostly reproduces in under 5 runs)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 9, 2019
* Other IOException during the replica write are simulated in the test
(`SearchWithRandomExceptionsIT`) and seem possible but will trip the assertion here -> removed it
* Closes elastic#40435
@original-brownbear
Copy link
Member

#44137 would theoretically fix this, but I'm not sure if I'm missing some deeper meaning behind the assertion it removes.

@dnhatn
Copy link
Member

dnhatn commented Jul 9, 2019

Armin helped me to reproduce this. I am self-assigning this as it might relate to #43523.

@dnhatn dnhatn assigned dnhatn and unassigned jimczi Jul 9, 2019
@dnhatn dnhatn added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Jul 9, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member

dnhatn commented Jul 10, 2019

We hit an IOException while executing an index on a replica. We should bubble that exception but here we wrap it a failure result instead. This should be fixed by #43523 where we strictly fail the engine if we hit any failure on non-primary operations.

Jul 10, 2019 10:23:29 AM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
WARNING: Uncaught exception in thread: Thread[elasticsearch[node_td4][write][T#1],5,TGRP-SearchWithRandomExceptionsIT]
java.lang.AssertionError: unexpected failure
	at __randomizedtesting.SeedInfo.seed([9232C9F794FC94FD]:0)
	at org.elasticsearch.action.support.replication.TransportWriteAction.syncOperationResultOrThrow(TransportWriteAction.java:80)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:418)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:393)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:79)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:526)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:500)
	at org.elasticsearch.index.shard.IndexShard.lambda$innerAcquireReplicaOperationPermit$21(IndexShard.java:2795)
	at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:112)
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:269)
	at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:236)
	at org.elasticsearch.index.shard.IndexShard.lambda$acquireReplicaOperationPermit$19(IndexShard.java:2734)
	at org.elasticsearch.index.shard.IndexShard.innerAcquireReplicaOperationPermit(IndexShard.java:2834)
	at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2733)
	at org.elasticsearch.action.support.replication.TransportReplicationAction.acquireReplicaOperationPermit(TransportReplicationAction.java:876)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:585)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at org.elasticsearch.action.support.replication.TransportReplicationAction.handleReplicaRequest(TransportReplicationAction.java:485)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
	at org.elasticsearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:267)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:757)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:835)
Caused by: java.io.IOException: Forced top level Exception on [Terms]
	at org.elasticsearch.search.basic.SearchWithRandomExceptionsIT$RandomExceptionDirectoryReaderWrapper$ThrowingSubReaderWrapper.maybeThrow(SearchWithRandomExceptionsIT.java:223)
	at org.elasticsearch.test.engine.ThrowingLeafReaderWrapper.terms(ThrowingLeafReaderWrapper.java:88)
	at org.apache.lucene.index.LeafReader.docFreq(LeafReader.java:77)
	at org.apache.lucene.search.IndexSearcher.count(IndexSearcher.java:346)
	at org.elasticsearch.index.engine.InternalEngine.assertDocDoesNotExist(InternalEngine.java:1196)
	at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1053)
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:895)
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:791)
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:763)
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:727)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:433)
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:413)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 10, 2019
* This is failing quiete often and we can reproduce it now so we don't
need additional test logging on CI
* Relates elastic#40435
pull bot pushed a commit to scher200/elasticsearch that referenced this issue Jul 10, 2019
* This is failing quiete often and we can reproduce it now so we don't
need additional test logging on CI
* Relates elastic#40435
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 10, 2019
* This is failing quiete often and we can reproduce it now so we don't
need additional test logging on CI
* Relates elastic#40435
original-brownbear added a commit that referenced this issue Jul 10, 2019
* This is failing quiete often and we can reproduce it now so we don't
need additional test logging on CI
* Relates #40435
@original-brownbear
Copy link
Member

I've muted this test in master and 7.x as it was failing quite often and made some experimentation around Debian build stability complicate.

dnhatn added a commit that referenced this issue Jul 14, 2019
An indexing on a replica should never fail after it was successfully
indexed on a primary. Hence, we should fail an engine if we hit any
failure (document level or tragic failure) when processing an indexing
on a replica.

Relates #43228
Closes #40435
dnhatn added a commit that referenced this issue Jul 15, 2019
An indexing on a replica should never fail after it was successfully
indexed on a primary. Hence, we should fail an engine if we hit any
failure (document level or tragic failure) when processing an indexing
on a replica.

Relates #43228
Closes #40435
@tlrx
Copy link
Member

tlrx commented Mar 25, 2020

@tlrx tlrx reopened this Mar 25, 2020
dnhatn added a commit that referenced this issue Apr 15, 2020
An indexing on a replica should never fail after it was successfully
indexed on a primary. Hence, we should fail an engine if we hit any
failure (document level or tragic failure) when processing an indexing
on a replica.

Relates #43228
Closes #40435
@dnhatn
Copy link
Member

dnhatn commented Apr 15, 2020

@tlrx Thanks for the ping. I've backported #43523 to 6.8.

@dnhatn dnhatn closed this as completed Apr 15, 2020
@tlrx
Copy link
Member

tlrx commented Apr 15, 2020

Thanks @dnhatn !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. :Search/Search Search-related issues that do not fall into other categories >test-failure Triaged test failures from CI
Projects
None yet
7 participants