Skip to content

[Zen2] Add safety phase to CoordinatorTests #34241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Oct 4, 2018

Conversation

DaveCTurner
Copy link
Contributor

Today's CoordinatorTests have a limited amount of randomisation in how things
are scheduled. However, to be fully confident in Zen2's liveness we require the
system to stabilise after any permitted sequence of events. We can achieve
this by running the system in a much more random fashion for a while, with much
larger variation in when things are scheduled (simulating GC pressure and
network disruption) and then continuing to assert that the system stabilises as
we expect. When running randomly, we do not expect to make significant progress
and merely verify that no safety property is violated.

This change introduces the runRandomly() test method which implements this
idea. It also fixes a handful of liveness bugs that this first version of
runRandomly() exposed.

@DaveCTurner DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Oct 2, 2018
@DaveCTurner DaveCTurner requested a review from ywelsch October 2, 2018 17:36
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor Author

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not perfect: for instance it fails on the following command line due to a lack of lag detection. I was hoping we could avoid lag detection for now, but this hope is misplaced. I'm not sure whether to avoid asserting the lack of lag or just to live with this for now.

./gradlew :server:test -Dtests.class=org.elasticsearch.cluster.coordination.CoordinatorTests -Dtests.jvm.argline=-Dhppc.bitmixer=DETERMINISTIC -Dtests.seed=F3285F709714F4:EBE3359BD09053D5 -Dtests.method=testNodesJoinAfterStableCluster

- use isConnectedPair rather than looking at disconnected/blackholed sets
- don't expect the follower to have a good state (no lag detection)
- check that the leader's state is exactly the nodes to which it is connected
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left a few comments. Looks great already!

@DaveCTurner
Copy link
Contributor Author

I've addressed all the comments so this is worth another look @ywelsch.

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -347,6 +347,9 @@ void addNodes(int newNodesCount) {

void runRandomly() {

assert disconnectedNodes.isEmpty() : "may reconnect disconnected nodes, probably unexpected: " + disconnectedNodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use hamcrest matchers? assertThat(disconnectedNodes, empty())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I pushed a9f3e00.

@ywelsch ywelsch mentioned this pull request Oct 3, 2018
61 tasks
@DaveCTurner
Copy link
Contributor Author

I merged the latest good master into zen2 and from there into here, hoping this'll pass.

As of 29ad624 the tests now seem robust but not 100% solid. After 1700 iterations we found a failure with:

./gradlew :server:test -Dtests.class=org.elasticsearch.cluster.coordination.CoordinatorTests -Dtests.method=testUnresponsiveLeaderDetectedEventually -Dtests.jvm.argline="-Dhppc.bitmixer=DETERMINISTIC" -Dtests.seed=C38D71BEBD582472:607698454084054E

This fails when forming a 4-node cluster - one of the nodes managed to end up in a term higher than the rest of them, and we lack term bumping so there's no way for it to recover.

@DaveCTurner DaveCTurner merged commit c6b0f08 into elastic:zen2 Oct 4, 2018
@DaveCTurner DaveCTurner deleted the 2018-10-02-run-randomly branch October 4, 2018 06:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants