[CI] ShrinkIndexIT testShrinkThenSplitWithFailedNode failure #44736

jkakavas · 2019-07-23T08:33:45Z

Build scan: https://gradle.com/s/k5loufononht4
Console log: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=zulu12,nodes=general-purpose/98/console

Failure:

07:53:55   2> REPRODUCE WITH: ./gradlew :server:integTest --tests "org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode" -Dtests.seed=7F9B0926CAB77C2 -Dtests.security.manager=true -Dtests.locale=nus -Dtests.timezone=America/Indiana/Knox -Dcompiler.java=12 -Druntime.java=12
07:53:55   2> java.lang.AssertionError: ResizeResponse failed - not acked
07:53:55     Expected: <true>
07:53:55          but: was <false>
07:53:55         at __randomizedtesting.SeedInfo.seed([7F9B0926CAB77C2:DAC471C7C22737B0]:0)
07:53:55         at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
07:53:55         at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:112)
07:53:55         at org.elasticsearch.test.hamcrest.ElasticsearchAssertions.assertAcked(ElasticsearchAssertions.java:100)
07:53:55         at org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode(ShrinkIndexIT.java:589)
07:53:55   1> [2019-07-22T23:53:46,590][INFO ][o.e.a.a.i.c.ShrinkIndexIT] [testCreateShrinkIndexToN] before test
07:53:55   1> [2019-07-22T23:53:46,590][INFO ][o.e.a.a.i.c.ShrinkIndexIT] [testCreateShrinkIndexToN] [ShrinkIndexIT#testCreateShrinkIndexToN]: setting up test
07:53:55   1> [2019-07-22T23:53:46,590][INFO ][o.e.t.InternalTestCluster] [testCreateShrinkIndexToN] adding voting config exclusions [node_s2] prior to restart/shutdown
07:53:55   1> [2019-07-22T23:53:46,620][INFO ][o.e.n.Node               ] [testCreateShrinkIndexToN] stopping ...
07:53:55   1> [2019-07-22T23:53:46,621][INFO ][o.e.c.c.Coordinator      ] [node_s2] master node [{node_s0}{0Vn4EYvgS7aBtp3STN22mQ}{0vkYnVenRCm120uqiHNWPg}{127.0.0.1}{127.0.0.1:42453}{dim}] failed, restarting discovery
07:53:55   1> org.elasticsearch.transport.NodeDisconnectedException: [node_s0][127.0.0.1:42453][disconnected] disconnected
07:53:55   1> [2019-07-22T23:53:46,623][INFO ][o.e.c.s.MasterService    ] [node_s0] node-left[{node_s2}{27VqX-qiQaqfZcYaMp-jUQ}{ELWQ7z4oRhazcYurKCwzxg}{127.0.0.1}{127.0.0.1:40475}{dim} disconnected], term: 1, version: 250, reason: removed {{node_s2}{27VqX-qiQaqfZcYaMp-jUQ}{ELWQ7z4oRhazcYurKCwzxg}{127.0.0.1}{127.0.0.1:40475}{dim},}
07:53:55   1> [2019-07-22T23:53:46,623][INFO ][o.e.n.Node               ] [testCreateShrinkIndexToN] stopped
07:53:55   1> [2019-07-22T23:53:46,623][INFO ][o.e.n.Node               ] [testCreateShrinkIndexToN] closing ...
07:53:55   1> [2019-07-22T23:53:46,625][INFO ][o.e.n.Node               ] [testCreateShrinkIndexToN] closed

Reproduction with:

./gradlew :server:integTest --tests "org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode" \
  -Dtests.seed=7F9B0926CAB77C2 \
  -Dtests.security.manager=true \
  -Dtests.locale=nus \
  -Dtests.timezone=America/Indiana/Knox \
  -Dcompiler.java=12 \
  -Druntime.java=12

./gradlew :server:integTest --tests "org.elasticsearch.action.admin.indices.create.ShrinkIndexIT.testShrinkThenSplitWithFailedNode" \
  -Dtests.seed=7F9B0926CAB77C2 \
  -Dtests.security.manager=true \
  -Dtests.locale=nus \
  -Dtests.timezone=America/Indiana/Knox \
  -Dcompiler.java=12 \
  -Druntime.java=12

This does not reproduce locally. Pinging @original-brownbear because of #44214

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-07-23T08:33:47Z

Pinging @elastic/es-distributed

…edNode (#44860) The test ShrinkIndexIT.testShrinkThenSplitWithFailedNode sometimes fails because the resize operation is not acknowledged (see #44736). This resize operation creates a new index "splitagain" and it results in a cluster state update (TransportResizeAction uses MetaDataCreateIndexService.createIndex() to create the resized index). This cluster state update is expected to be acknowledged by all nodes (see IndexCreationTask.onAllNodesAcked()) but this is not always true: the data node that was just stopped in the test before executing the resize operation might still be considered as a "faulty" node (and not yet removed from the cluster nodes) by the FollowersChecker. The cluster state is then acked on all nodes but one, and it results in a non acknowledged resize operation. This commit adds an ensureStableCluster() check after stopping the node in the test. The goal is to ensure that the data node has been correctly removed from the cluster and that all nodes are fully connected to each before moving forward with the resize operation. Closes #44736

jkakavas added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. labels Jul 23, 2019

original-brownbear assigned original-brownbear and tlrx and unassigned original-brownbear Jul 23, 2019

tlrx mentioned this issue Jul 25, 2019

Ensure cluster is stable in ShrinkIndexIT.testShrinkThenSplitWithFailedNode #44860

Merged

tlrx closed this as completed in #44860 Jul 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] ShrinkIndexIT testShrinkThenSplitWithFailedNode failure #44736

[CI] ShrinkIndexIT testShrinkThenSplitWithFailedNode failure #44736

jkakavas commented Jul 23, 2019 •

edited

Loading

elasticmachine commented Jul 23, 2019

[CI] ShrinkIndexIT testShrinkThenSplitWithFailedNode failure #44736

[CI] ShrinkIndexIT testShrinkThenSplitWithFailedNode failure #44736

Comments

jkakavas commented Jul 23, 2019 • edited Loading

elasticmachine commented Jul 23, 2019

jkakavas commented Jul 23, 2019 •

edited

Loading