Integrate LeaderChecker with Coordinator #34049

DaveCTurner · 2018-09-25T13:51:09Z

This change ensures that follower nodes periodically check that their leader is
healthy, and that they elect a new leader if not.

This change ensures that follower nodes periodically check that their leader is healthy, and that they elect a new leader if not.

elasticmachine · 2018-09-25T13:51:12Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-09-25T13:51:31Z

test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java

@@ -94,7 +95,12 @@ public String toString() {
            public void run() {
                switch (getConnectionStatus(getLocalNode(), destination)) {
                    case BLACK_HOLE:
-                        logger.trace("dropping {}", requestDescription);
+                        if (action.equals(HANDSHAKE_ACTION_NAME)) {


This makes me unhappy, but I was unable to come up with a better idea. Thoughts?

I've pushed a7a76c0 which refactors the class a bit to make it more extensible. Let me know if you like that better.

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

ywelsch · 2018-09-25T16:16:54Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

+        if (leaderCheckScheduler != null) {
+            leaderCheckScheduler.close();
+        }
+        leaderCheckScheduler = leaderChecker.startLeaderChecker(leaderNode);


I'm confused. Does this mean we restart the leader checker on every incoming publication? We call becomeFollower on every incoming publication

It did mean a new leader checker on each publication. I pushed be15266 not to do this. However I can't think of a good way to reliably assert that we don't do this: both ways have pretty much the right liveness properties, and my other two ideas are:

check we don't send another leader check immediately after each publication (not robust)

check for reference equality of the leaderCheckScheduler object before/after a second publication (don't fancy exposing this).

However I can't think of a good way to reliably assert that we don't do this

I don't have any good idea here. I'll keep thinking about this. Should not block this PR though.

server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java

ywelsch · 2018-09-25T16:48:29Z

test/framework/src/main/java/org/elasticsearch/test/disruption/DisruptableMockTransport.java

@@ -94,7 +95,12 @@ public String toString() {
            public void run() {
                switch (getConnectionStatus(getLocalNode(), destination)) {
                    case BLACK_HOLE:
-                        logger.trace("dropping {}", requestDescription);
+                        if (action.equals(HANDSHAKE_ACTION_NAME)) {


I've pushed a7a76c0 which refactors the class a bit to make it more extensible. Let me know if you like that better.

DaveCTurner · 2018-09-26T08:51:25Z

Thanks for a7a76c0 (for some reason I can't reply to this comment inline). I moved the log statement in 98a8236 but otherwise LGTM.

ywelsch

LGTM

Integrate LeaderChecker with Coordinator

93cbcb4

This change ensures that follower nodes periodically check that their leader is healthy, and that they elect a new leader if not.

DaveCTurner added >enhancement v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Sep 25, 2018

DaveCTurner requested a review from ywelsch September 25, 2018 13:51

DaveCTurner commented Sep 25, 2018

View reviewed changes

ywelsch mentioned this pull request Sep 25, 2018

A new cluster coordination layer #32006

Closed

61 tasks

DaveCTurner and others added 3 commits September 25, 2018 16:22

Fix serialization of TestResponse

5264468

Fix NodeJoinTests

24af1ce

Refactor DisruptableMockTransport to allow better extension

a7a76c0

ywelsch reviewed Sep 25, 2018

View reviewed changes

DaveCTurner added 5 commits September 26, 2018 09:24

Helper method

509e242

Assert

3de5905

Rename partitioned->blackholed

857013f

Move log message

98a8236

Do not restart leader checker if already following the right leader

be15266

ywelsch approved these changes Sep 26, 2018

View reviewed changes

DaveCTurner merged commit d995fc8 into elastic:zen2 Sep 26, 2018

DaveCTurner deleted the 2018-09-25-integrate-leader-checker branch September 26, 2018 11:18

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate LeaderChecker with Coordinator #34049

Integrate LeaderChecker with Coordinator #34049

DaveCTurner commented Sep 25, 2018

elasticmachine commented Sep 25, 2018

DaveCTurner Sep 25, 2018

ywelsch Sep 25, 2018

ywelsch Sep 25, 2018

DaveCTurner Sep 26, 2018

ywelsch Sep 26, 2018 •

edited

Loading

ywelsch Sep 25, 2018

DaveCTurner commented Sep 26, 2018

ywelsch left a comment

Integrate LeaderChecker with Coordinator #34049

Integrate LeaderChecker with Coordinator #34049

Conversation

DaveCTurner commented Sep 25, 2018

elasticmachine commented Sep 25, 2018

DaveCTurner Sep 25, 2018

Choose a reason for hiding this comment

ywelsch Sep 25, 2018

Choose a reason for hiding this comment

ywelsch Sep 25, 2018

Choose a reason for hiding this comment

DaveCTurner Sep 26, 2018

Choose a reason for hiding this comment

ywelsch Sep 26, 2018 • edited Loading

Choose a reason for hiding this comment

ywelsch Sep 25, 2018

Choose a reason for hiding this comment

DaveCTurner commented Sep 26, 2018

ywelsch left a comment

Choose a reason for hiding this comment

ywelsch Sep 26, 2018 •

edited

Loading