Skip to content

StableMasterDisruptionIT.testStaleMasterNotHijackingMajority failing in CI #43392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbaiera opened this issue Jun 19, 2019 · 6 comments
Closed
Assignees
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI

Comments

@jbaiera
Copy link
Member

jbaiera commented Jun 19, 2019

In the last 6 days (since about June 13), there's been an uptick in failures pertaining to this test case.

The offending assertion looks like its:

        // Wait for majority side to elect a new master
        assertBusy(() -> {
            for (final Map.Entry<String, List<Tuple<String, String>>> entry : masters.entrySet()) {
                final List<Tuple<String, String>> transitions = entry.getValue();
                assertTrue(entry.getKey() + ": " + transitions,
                    transitions.stream().anyMatch(transition -> transition.v2() != null));
            }
        });

Failing with:

org.elasticsearch.discovery.StableMasterDisruptionIT > testStaleMasterNotHijackingMajority FAILED
--
java.lang.AssertionError: node_t2: [Tuple [v1=node_t1, v2=null]]
at __randomizedtesting.SeedInfo.seed([C5E917ABBDEFB99F:9AEF5AA49C346693]:0)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.elasticsearch.discovery.StableMasterDisruptionIT.lambda$testStaleMasterNotHijackingMajority$5(StableMasterDisruptionIT.java:226)
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:874)
at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
at org.elasticsearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority(StableMasterDisruptionIT.java:223)
REPRODUCE WITH: ./gradlew :server:integTest --tests "org.elasticsearch.discovery.StableMasterDisruptionIT.testStaleMasterNotHijackingMajority" -Dtests.seed=C5E917ABBDEFB99F -Dtests.security.manager=true -Dtests.locale=en-GY -Dtests.timezone=SystemV/MST7MDT -Dcompiler.java=12 -Druntime.java=11

Gradle Scan from 13th: https://scans.gradle.com/s/tfwcqwg266oam
Scan from today: https://scans.gradle.com/s/2edje7yces7zc

@jbaiera jbaiera added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Jun 19, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner DaveCTurner self-assigned this Jun 20, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Jun 20, 2019

The test is using LongGCDisruption, which suspends threads using Thread.suspend(). The nodes properly detect that node_t1 is unavailable, but do not succeed in electing a new leader amongst themselves as they look to be unable to communicate with each other. It looks like pausing node_t1 is blocking the event loop threads on the other nodes. @original-brownbear @tbrooks8 thoughts?

Relevant log lines:

1> [2019-06-19T12:43:15,274][INFO ][o.e.d.StableMasterDisruptionIT] [testStaleMasterNotHijackingMajority] --> freezing node [node_t1]
...
1> [2019-06-19T12:43:22,698][WARN ][o.e.t.n.MockNioTransport ] [node_t0] Potentially blocked execution on network thread [elasticsearch[node_t0][transport_worker][T#2]] [1400 milliseconds]: 
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:533)
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:285)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory$$Lambda$1656/0x0000000100731c40.run(Unknown Source)
  1> [email protected]/java.security.AccessController.doPrivileged(Native Method)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory.accept(ChannelFactory.java:223)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:180)
  1> app//org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55)
  1> app//org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47)
  1> app//org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45)
  1> app//org.elasticsearch.transport.nio.TestEventHandler.acceptChannel(TestEventHandler.java:51)
  1> app//org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227)
  1> app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
  1> app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
  1> app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1589/0x00000001006f1840.run(Unknown Source)
  1> [email protected]/java.lang.Thread.run(Thread.java:834)
  1> [2019-06-19T12:43:22,947][WARN ][o.e.t.n.MockNioTransport ] [node_t2] Potentially blocked execution on network thread [elasticsearch[node_t2][transport_worker][T#2]] [1600 milliseconds]: 
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:533)
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:285)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory$$Lambda$1656/0x0000000100731c40.run(Unknown Source)
  1> [email protected]/java.security.AccessController.doPrivileged(Native Method)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory.accept(ChannelFactory.java:223)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:180)
  1> app//org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55)
  1> app//org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47)
  1> app//org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45)
  1> app//org.elasticsearch.transport.nio.TestEventHandler.acceptChannel(TestEventHandler.java:51)
  1> app//org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227)
  1> app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
  1> app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
  1> app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1589/0x00000001006f1840.run(Unknown Source)
  1> [email protected]/java.lang.Thread.run(Thread.java:834)
1> [2019-06-19T12:43:23,190][DEBUG][o.e.d.FileBasedSeedHostsProvider] [node_t0] seed addresses: [127.0.0.1:37605, 127.0.0.1:37675]
  1> [2019-06-19T12:43:23,193][DEBUG][o.e.d.FileBasedSeedHostsProvider] [node_t2] seed addresses: [127.0.0.1:45615, 127.0.0.1:37605]
  1> [2019-06-19T12:43:24,190][DEBUG][o.e.d.FileBasedSeedHostsProvider] [node_t0] seed addresses: [127.0.0.1:37605, 127.0.0.1:37675]
  1> [2019-06-19T12:43:24,193][DEBUG][o.e.d.FileBasedSeedHostsProvider] [node_t2] seed addresses: [127.0.0.1:45615, 127.0.0.1:37605]
  1> [2019-06-19T12:43:24,699][WARN ][o.e.t.n.MockNioTransport ] [node_t0] Potentially blocked execution on network thread [elasticsearch[node_t0][transport_worker][T#2]] [3445 milliseconds]: 
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:533)
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:285)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory$$Lambda$1656/0x0000000100731c40.run(Unknown Source)
  1> [email protected]/java.security.AccessController.doPrivileged(Native Method)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory.accept(ChannelFactory.java:223)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:180)
  1> app//org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55)
  1> app//org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47)
  1> app//org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45)
  1> app//org.elasticsearch.transport.nio.TestEventHandler.acceptChannel(TestEventHandler.java:51)
  1> app//org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227)
  1> app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
  1> app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
  1> app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1589/0x00000001006f1840.run(Unknown Source)
  1> [email protected]/java.lang.Thread.run(Thread.java:834)
  1> [2019-06-19T12:43:24,948][WARN ][o.e.t.n.MockNioTransport ] [node_t2] Potentially blocked execution on network thread [elasticsearch[node_t2][transport_worker][T#2]] [3601 milliseconds]: 
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:533)
  1> [email protected]/sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:285)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory$$Lambda$1656/0x0000000100731c40.run(Unknown Source)
  1> [email protected]/java.security.AccessController.doPrivileged(Native Method)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory.accept(ChannelFactory.java:223)
  1> app//org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:180)
  1> app//org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55)
  1> app//org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47)
  1> app//org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45)
  1> app//org.elasticsearch.transport.nio.TestEventHandler.acceptChannel(TestEventHandler.java:51)
  1> app//org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227)
  1> app//org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172)
  1> app//org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129)
  1> app//org.elasticsearch.nio.NioSelectorGroup$$Lambda$1589/0x00000001006f1840.run(Unknown Source)
  1> [email protected]/java.lang.Thread.run(Thread.java:834)
  1> [2019-06-19T12:43:25,190][DEBUG][o.e.d.FileBasedSeedHostsProvider] [node_t0] seed addresses: [127.0.0.1:37605, 127.0.0.1:37675]
  1> [2019-06-19T12:43:25,193][DEBUG][o.e.d.FileBasedSeedHostsProvider] [node_t2] seed addresses: [127.0.0.1:45615, 127.0.0.1:37605]

@DaveCTurner DaveCTurner removed their assignment Jun 20, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Jun 20, 2019

Relates #43387

@original-brownbear original-brownbear self-assigned this Jun 20, 2019
@original-brownbear
Copy link
Member

original-brownbear commented Jun 20, 2019

@ywelsch yea it certainly looks like suspending the node threads is to blame here most likely. It seems to me like that's the only action we take that could affect multiple selectors at once. I'll look into this shortly, if I remember correctly there were some issues with suspending threads and socket channels in JDK8.
One second thought: I'm not 100% sure that's true actually ... if you look at the log #43387 you can find the selector loop getting slower and slower until it finally blocks for at least 30s (then it gets interrupted by the test). The slower and slower there could be a coincidence, but it looks suspicious as well. Regardless, on it :)

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 20, 2019
* Let's log the state of the thread to find out if it's dead-locked or just stuck after being suspended
* Relates elastic#43392
@original-brownbear
Copy link
Member

I think the added logging of the stuck thread's state in #43424 should suffice to figure out if this is a result of the thread suspend call.

original-brownbear added a commit that referenced this issue Jun 20, 2019
* Let's log the state of the thread to find out if it's dead-locked or just stuck after being suspended
* Relates #43392
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 20, 2019
* Let's log the state of the thread to find out if it's dead-locked or just stuck after being suspended
* Relates elastic#43392
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 20, 2019
* Let's log the state of the thread to find out if it's dead-locked or just stuck after being suspended
* Relates elastic#43392
original-brownbear added a commit that referenced this issue Jun 20, 2019
* Let's log the state of the thread to find out if it's dead-locked or just stuck after being suspended
* Relates #43392
original-brownbear added a commit that referenced this issue Jun 20, 2019
* Let's log the state of the thread to find out if it's dead-locked or just stuck after being suspended
* Relates #43392
@ywelsch
Copy link
Contributor

ywelsch commented Jun 24, 2019

Failure tracked in #43387

@ywelsch ywelsch closed this as completed Jun 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

5 participants