[Zen2] Add safety phase to CoordinatorTests #34241

DaveCTurner · 2018-10-02T17:36:17Z

Today's CoordinatorTests have a limited amount of randomisation in how things
are scheduled. However, to be fully confident in Zen2's liveness we require the
system to stabilise after any permitted sequence of events. We can achieve
this by running the system in a much more random fashion for a while, with much
larger variation in when things are scheduled (simulating GC pressure and
network disruption) and then continuing to assert that the system stabilises as
we expect. When running randomly, we do not expect to make significant progress
and merely verify that no safety property is violated.

This change introduces the runRandomly() test method which implements this
idea. It also fixes a handful of liveness bugs that this first version of
runRandomly() exposed.

elasticmachine · 2018-10-02T17:36:19Z

Pinging @elastic/es-distributed

DaveCTurner

This change is not perfect: for instance it fails on the following command line due to a lack of lag detection. I was hoping we could avoid lag detection for now, but this hope is misplaced. I'm not sure whether to avoid asserting the lack of lag or just to live with this for now.

./gradlew :server:test -Dtests.class=org.elasticsearch.cluster.coordination.CoordinatorTests -Dtests.jvm.argline=-Dhppc.bitmixer=DETERMINISTIC -Dtests.seed=F3285F709714F4:EBE3359BD09053D5 -Dtests.method=testNodesJoinAfterStableCluster

server/src/main/java/org/elasticsearch/common/util/concurrent/ListenableFuture.java

server/src/main/java/org/elasticsearch/discovery/PeerFinder.java

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

- use isConnectedPair rather than looking at disconnected/blackholed sets - don't expect the follower to have a good state (no lag detection) - check that the leader's state is exactly the nodes to which it is connected

ywelsch

I've left a few comments. Looks great already!

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

server/src/main/java/org/elasticsearch/discovery/PeerFinder.java

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

DaveCTurner · 2018-10-03T14:09:14Z

I've addressed all the comments so this is worth another look @ywelsch.

ywelsch

LGTM

ywelsch · 2018-10-03T14:59:08Z

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

@@ -347,6 +347,9 @@ void addNodes(int newNodesCount) {

        void runRandomly() {

+            assert disconnectedNodes.isEmpty() : "may reconnect disconnected nodes, probably unexpected: " + disconnectedNodes;


maybe use hamcrest matchers? assertThat(disconnectedNodes, empty())?

Sure, I pushed a9f3e00.

This reverts commit 1a58b48.

DaveCTurner · 2018-10-03T21:22:00Z

I merged the latest good master into zen2 and from there into here, hoping this'll pass.

As of 29ad624 the tests now seem robust but not 100% solid. After 1700 iterations we found a failure with:

./gradlew :server:test -Dtests.class=org.elasticsearch.cluster.coordination.CoordinatorTests -Dtests.method=testUnresponsiveLeaderDetectedEventually -Dtests.jvm.argline="-Dhppc.bitmixer=DETERMINISTIC" -Dtests.seed=C38D71BEBD582472:607698454084054E

This fails when forming a 4-node cluster - one of the nodes managed to end up in a term higher than the rest of them, and we lack term bumping so there's no way for it to recover.

DaveCTurner added 14 commits October 2, 2018 14:56

Introduce runRandomly

aa41f6e

Ignore publications from self if no longer leading

24b4a54

Must become follower after successfully processing publish request

29cab38

Fix log messages

289d39f

Must guard peer removal

a556784

Timeout RequestPeersRequests

91b0314

Reset port counter more frequently

19bd1ae

Remove failing assertion for now

4d717b9

Add comment describing why we reject a publication here

2b9883e

Re-qualify mode

c4ffe22

Revert

c66c4b0

No need for this guard without the affected assertion

a4d5c74

Add test that PeersRequest has a timeout

f777097

Line length

3a7fc1d

DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 2, 2018

DaveCTurner requested a review from ywelsch October 2, 2018 17:36

DaveCTurner commented Oct 2, 2018

View reviewed changes

DaveCTurner added 2 commits October 3, 2018 08:39

Merge branch 'zen2' into 2018-10-02-run-randomly

d1b2535

Rework stabilisation assertions

76c1d05

- use isConnectedPair rather than looking at disconnected/blackholed sets - don't expect the follower to have a good state (no lag detection) - check that the leader's state is exactly the nodes to which it is connected

ywelsch suggested changes Oct 3, 2018

View reviewed changes

DaveCTurner added 7 commits October 3, 2018 13:08

Term mismatch is rejected by CoordinatorState

0b180f0

Register setting

c14cf7a

Reduce default for discovery.request_peers_timeout to 3s

ff33d68

Execute directly, no need for mutex

e5bb9b3

Only runRandomly if no disruptions in place

af227e7

Only log changes to connected state

f8d5458

Revamp disruption-management logic

78757f9

DaveCTurner added 3 commits October 3, 2018 15:28

Fix description of publication

cf7047a

Fix order of requestId/action in request description

a470eca

Remove bogus assertion

dc2bddc

ywelsch approved these changes Oct 3, 2018

View reviewed changes

DaveCTurner added 8 commits October 3, 2018 16:17

No need for this method any more

257eb85

More state in assertion message

1d9a917

Stand down as master in rare condition

1a58b48

Revert "Stand down as master in rare condition"

b7a5328

This reverts commit 1a58b48.

Move assertion

34eb71b

Timeout join requests

98a24f4

Use assertThat for better error messages

a9f3e00

Merge branch 'zen2' into 2018-10-02-run-randomly

29ad624

ywelsch mentioned this pull request Oct 3, 2018

A new cluster coordination layer #32006

Closed

61 tasks

DaveCTurner merged commit c6b0f08 into elastic:zen2 Oct 4, 2018

DaveCTurner deleted the 2018-10-02-run-randomly branch October 4, 2018 06:40

DaveCTurner mentioned this pull request Oct 4, 2018

Zen2: Add Cluster State Applier #34257

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Zen2] Add safety phase to CoordinatorTests #34241

[Zen2] Add safety phase to CoordinatorTests #34241

DaveCTurner commented Oct 2, 2018

elasticmachine commented Oct 2, 2018

DaveCTurner left a comment

ywelsch left a comment

DaveCTurner commented Oct 3, 2018

ywelsch left a comment

ywelsch Oct 3, 2018

DaveCTurner Oct 3, 2018

DaveCTurner commented Oct 3, 2018

		@@ -347,6 +347,9 @@ void addNodes(int newNodesCount) {

		void runRandomly() {

		assert disconnectedNodes.isEmpty() : "may reconnect disconnected nodes, probably unexpected: " + disconnectedNodes;

[Zen2] Add safety phase to CoordinatorTests #34241

[Zen2] Add safety phase to CoordinatorTests #34241

Conversation

DaveCTurner commented Oct 2, 2018

elasticmachine commented Oct 2, 2018

DaveCTurner left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

DaveCTurner commented Oct 3, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Oct 3, 2018

Choose a reason for hiding this comment

DaveCTurner Oct 3, 2018

Choose a reason for hiding this comment

DaveCTurner commented Oct 3, 2018