Skip to content

[CI] org.elasticsearch.discovery.DiscoveryDisruptionIT#testElectMasterWithLatestVersion failures #37539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jimczi opened this issue Jan 16, 2019 · 3 comments
Assignees
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >test-failure Triaged test failures from CI

Comments

@jimczi
Copy link
Contributor

jimczi commented Jan 16, 2019

org.elasticsearch.discovery.DiscoveryDisruptionIT#testElectMasterWithLatestVersion started to fail on 6x and master recently:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1315/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+intake/1034/console

The test fails with the following error (cannot reproduce locally):

java.lang.AssertionError: node [node_t1] still has [{node_t2}{H4ka7FkMQWmsZMQ3_VN1gQ}{7zSO3xGvSqGmBJqDGJynjw}{127.0.0.1}{127.0.0.1:36399}] as master expected null, but was:<{node_t2}{H4ka7FkMQWmsZMQ3_VN1gQ}{7zSO3xGvSqGmBJqDGJynjw}{127.0.0.1}{127.0.0.1:36399}>
	at __randomizedtesting.SeedInfo.seed([BB146F5E215F29E0:5BAC000A598C5A6A]:0)
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotNull(Assert.java:755)
	at org.junit.Assert.assertNull(Assert.java:737)
	at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:154)
	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
	at org.elasticsearch.discovery.AbstractDisruptionTestCase.assertNoMaster(AbstractDisruptionTestCase.java:151)
	at org.elasticsearch.discovery.MasterDisruptionIT.testVerifyApiBlocksDuringPartition(MasterDisruptionIT.java:411)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
	at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
	at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.lang.Thread.run(Thread.java:748)
	Suppressed: java.lang.AssertionError: node [node_t1] still has [{node_t2}{H4ka7FkMQWmsZMQ3_VN1gQ}{7zSO3xGvSqGmBJqDGJynjw}{127.0.0.1}{127.0.0.1:36399}] as master expected null, but was:<{node_t2}{H4ka7FkMQWmsZMQ3_VN1gQ}{7zSO3xGvSqGmBJqDGJynjw}{127.0.0.1}{127.0.0.1:36399}>
		at org.junit.Assert.fail(Assert.java:88)
		at org.junit.Assert.failNotNull(Assert.java:755)
		at org.junit.Assert.assertNull(Assert.java:737)
		at org.elasticsearch.discovery.AbstractDisruptionTestCase.lambda$assertNoMaster$0(AbstractDisruptionTestCase.java:154)
		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
@jimczi jimczi added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. labels Jan 16, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

alpar-t added a commit that referenced this issue Jan 22, 2019
alpar-t added a commit that referenced this issue Jan 22, 2019
@DaveCTurner DaveCTurner self-assigned this Feb 1, 2019
@andrershov andrershov assigned andrershov and unassigned DaveCTurner Feb 6, 2019
@andrershov
Copy link
Contributor

andrershov commented Feb 6, 2019

The caption mentions DiscoveryDisruptionIT#testElectMasterWithLatestVersion, but exception provided in description relates to a different test MasterDisruptionIT.testVerifyApiBlocksDuringPartition.
There is already an issue for MasterDisruptionIT #37276.
Moreover, build links no longer work.
DiscoveryDisruptionIT#testElectMasterWithLatestVersion actually was failing several times according to Kibana, and the latest failure (with logs still available) https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+release-tests/362/consoleFull.
From these logs, I can see that test failed when asserting no master on node_t2, which previously was a master, after introducing network disruption (second time).
And the reason for the failure seems a stall on node_t2, here the logs from this node:

1> [2019-01-21T06:27:50,301][DEBUG][o.e.c.s.MasterService    ] [node_t2] processing [node-left[{node_t0}{g7ZxqRehSF6MkXW0XJMwqg}{xasWEDc2SXiHXTigXFKYnA}{127.0.0.1}{127.0.0.1:46373} disconnected]]: execute
1> [2019-01-21T06:27:50,301][DEBUG][o.e.c.r.a.AllocationService] [node_t2] [test][0] failing shard [test][0], node[g7ZxqRehSF6MkXW0XJMwqg], [P], s[STARTED], a[id=-sTvScv4SzyovUuISa2DsQ] with unassigned info ([reason=NODE_LEFT], at[2019-01-21T11:27:50.301Z], delayed=true, details[node_left[g7ZxqRehSF6MkXW0XJMwqg]], allocation_status[no_attempt])

16 seconds stall

1> [2019-01-21T06:28:06,712][DEBUG][o.e.c.r.a.a.BalancedShardsAllocator] [node_t2] skipping rebalance due to in-flight shard/store fetches
1> [2019-01-21T06:28:06,712][TRACE][o.e.c.s.MasterService    ] [node_t2] cluster state updated, source [node-left[{node_t0}{g7ZxqRehSF6MkXW0XJMwqg}{xasWEDc2SXiHXTigXFKYnA}{127.0.0.1}{127.0.0.1:46373} disconnected]]
1> cluster uuid: Hc9PTwFsRcm09xct1FfCJg
1> version: 9

Since this test failure was failing during unstable infrastructure window period, I think we need to unmute this method.

@ywelsch
Copy link
Contributor

ywelsch commented Mar 8, 2019

For the reasons that @andrershov has outline above, and no recent failures of this test (testElectMasterWithLatestVersion), I'm closing this issue

@ywelsch ywelsch closed this as completed Mar 8, 2019
andrershov pushed a commit that referenced this issue Jul 31, 2019
…thLatestVersion (#38555)

See my comments for #37539 and #37685

(cherry picked from commit 038d4ab)
andrershov pushed a commit that referenced this issue Jul 31, 2019
…thLatestVersion (#38555)

See my comments for #37539 and #37685

(cherry picked from commit 038d4ab)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants