Skip to content

[CI] TranslogTests#testFatalIOExceptionsWhileWritingConcurrently times out #29509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
javanna opened this issue Apr 13, 2018 · 6 comments · Fixed by #29520
Closed

[CI] TranslogTests#testFatalIOExceptionsWhileWritingConcurrently times out #29509

javanna opened this issue Apr 13, 2018 · 6 comments · Fixed by #29520
Assignees
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.3.0

Comments

@javanna
Copy link
Member

javanna commented Apr 13, 2018

TranslogTests#testFatalIOExceptionsWhileWritingConcurrently fails with a suite timeout error. This happened a couple of times a day in the last few days.

Example of a recent failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=centos/2300/console .

I can't reproduce this failure with the seed, but it is rather easy to reproduce it just by running this test some times (I tried once with iters set to 100 and it failed pretty quickly). Sounds like it's a timing issue then. I wonder if it may be caused by recent changes in #29401 that were fixing another failure for this same test method.

REPRODUCE WITH: ./gradlew :server:test \
  -Dtests.seed=9842CD3971C5E19D \
  -Dtests.class=org.elasticsearch.index.translog.TranslogTests \
  -Dtests.method="testFatalIOExceptionsWhileWritingConcurrently" \
  -Dtests.security.manager=true \
  -Dtests.locale=he-IL \
  -Dtests.timezone=America/North_Dakota/Beulah

REPRODUCE WITH: ./gradlew :server:test \
  -Dtests.seed=9842CD3971C5E19D \
  -Dtests.class=org.elasticsearch.index.translog.TranslogTests \
  -Dtests.security.manager=true \
  -Dtests.locale=he-IL \
  -Dtests.timezone=America/North_Dakota/Beulah
@javanna javanna added the >test-failure Triaged test failures from CI label Apr 13, 2018
@javanna
Copy link
Member Author

javanna commented Apr 13, 2018

@jasontedor do you have a chance to have a look at this given that you touched this test a few days ago?

@colings86 colings86 added the :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. label Apr 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jasontedor jasontedor added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Apr 13, 2018
@javanna
Copy link
Member Author

javanna commented Apr 13, 2018

javanna added a commit that referenced this issue Apr 13, 2018
This test has been failing quite a few times with a suite timeout,
opened #29509 for it.
javanna added a commit that referenced this issue Apr 13, 2018
This test has been failing quite a few times with a suite timeout,
opened #29509 for it.
@javanna
Copy link
Member Author

javanna commented Apr 13, 2018

Test muted in master and 6.x branches.

@jasontedor
Copy link
Member

A translog thread can deadlock itself if it holds the read lock (Translog#readOperation) and needs to close the translog on a tragic event. When closing, we try to acquire a write lock. Since the thread already holds a read lock and this can not be upgraded to a write lock, it deadlocks waiting for itself to release the read lock.

@jasontedor jasontedor added >bug v6.3.0 and removed >test-failure Triaged test failures from CI labels Apr 15, 2018
@jasontedor
Copy link
Member

I opened #29520.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants