Get test working #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

DaveCTurner merged 6 commits into DaveCTurner:2024/05/22/versioned-master-node-requests from nicktindall:2024/05/22/versioned-master-node-requests

Jul 2, 2024

nicktindall commented Jun 18, 2024 •

edited

Loading

Initial work on getting the test working

This test now appears to reliably reproduce the issue that was fixed. I verified it by removing the request.masterTerm check from TransportMasterNodeAction.

Proposal/Query

I know I've missed something, but would it not be less fragile and maybe easier to reason about if instead of forwarding MasterNodeRequests on to the master recursively, we recorded how many hops we are from the coordinating node and throw a StaleRoutingStateException if we get to hop 1 and the receiver not the master? (i.e. if we were forwarded a MasterNodeRequest despite us not being the master).

The StaleRoutingStateException could include the routed-to node's current term so that the forwarding node could delay until it reached that term before attempting to forward again (or just apply some reasonable delay if the receiver is lagging).

With this approach we could put the coordinating node solely in charge of retrying the request when it fails due to a StaleRoutingStateException which I think would fix the duplicate tasks issue resulting from a deeply re-routed request (which I think can still occur when a request is legitimately chasing a master around the cluster?).

nicktindall commented

View reviewed changes

...nalClusterTest/java/org/elasticsearch/action/support/master/TransportMasterNodeActionIT.java

    
                                      assertTrue("rerouted message received exactly once", reroutedMessageReceived.compareAndSet(false, true));

                                      handler.messageReceived(request, channel, task);

                                  }

                              );

Author

nicktindall Jun 18, 2024 •

edited

Loading

I changed the test to use TransportClusterHealthAction as the redirected request instead of ClusterStateAction because InternalTestCluster#getMasterName() (used by masterClient etc.) triggers a ClusterStateAction, which was making the received-at-most-once assertions invalid.


          Get test working

ce14d2e

nicktindall force-pushed the 2024/05/22/versioned-master-node-requests branch from c5e4b9b to ce14d2e Compare

June 18, 2024 05:35

nicktindall commented

View reviewed changes

...nalClusterTest/java/org/elasticsearch/action/support/master/TransportMasterNodeActionIT.java Show resolved Hide resolved

nicktindall added 2 commits

June 19, 2024 09:52


          Remove unnecessary boolean

5d076eb


          Implement proper cleanup

a8d514d

nicktindall commented

View reviewed changes

...nalClusterTest/java/org/elasticsearch/action/support/master/TransportMasterNodeActionIT.java

                       return CollectionUtils.appendToCopy(super.nodePlugins(), MockTransportService.TestPlugin.class);
                   }
-                  @TestLogging(reason = "wip", value = "org.elasticsearch.transport.TransportService.tracer:TRACE")
+                  @TestLogging(reason = "wip", value = "org.elasticsearch.action.support.master.TransportMasterNodeAction:DEBUG")

Author

nicktindall Jun 19, 2024

Left this in so you can see

[2024-06-19T07:58:56,657][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [node_s5] request routed to master in term [2] but local term is [1], waiting for local term bump

nicktindall commented

View reviewed changes

...nalClusterTest/java/org/elasticsearch/action/support/master/TransportMasterNodeActionIT.java Outdated Show resolved Hide resolved

nicktindall added 3 commits

June 19, 2024 10:32


          Localise cleanup definitions (add more cleanup)

40710e8


          Factor out state applier blocking

f80d012


          More tidy

nicktindall commented

View reviewed changes

...nalClusterTest/java/org/elasticsearch/action/support/master/TransportMasterNodeActionIT.java

+                      } finally {
+                          newMasterClusterService.removeApplier(clusterFormationMonitor);
+                      }
+                  }

Author

nicktindall Jun 21, 2024

This feels a bit hacky, but I struggled to see how we could

Block the applier on newMaster to prevent it clearing the current master (in becomeCandidate)
Allow newMaster to send a joinRequest to itself via the cluster applier (in JoinHelper)
Block the applier on newMaster to prevent it applying the cluster state after it wins an election

I think (although with all the asynchrony, I'm not sure) that the above things happen in the above order, so I don't know how we could block 1 and 3 whilst allowing 2 to go through - and if we could, would the extra machinery just make the test harder to read and more brittle?

That means we can't rely on our own join, and because the old master sometimes(?) accepts its failed update, we can't rely on its join, so we can't reliably get a quorum in a 3 node cluster. If we bump the cluster size up to 5 nodes we seem to succeed every time.

nicktindall commented

View reviewed changes

...nalClusterTest/java/org/elasticsearch/action/support/master/TransportMasterNodeActionIT.java

-                          safeAwait(electionWonLatch);
+                          // Wait until the old master has acknowledged the new master's election
+                          safeAwait(previousMasterKnowsNewMasterIsElectedLatch);
+                          logger.info("New master is elected");

Author

nicktindall Jun 21, 2024

because we're preventing cluster state updates being applied on newMaster, we watch the previous master's cluster state to determine when we've won the election. We need to wait for the previous master to apply the term bump anyway before we send the to-be-redirected request so this serves multiple purposes.

DaveCTurner merged commit fc88ff4 into DaveCTurner:2024/05/22/versioned-master-node-requests

Owner

DaveCTurner commented Jul 2, 2024

I'm merging this into the PR branch so I can iterate on it a bit more

DaveCTurner pushed a commit that referenced this pull request


          Revert "[test] Dynamically pick up the upper bound snapshot index ver…

7feb4d5

…sion (#1…" (elastic#115827)

This reverts commit 32dee6a.

DaveCTurner pushed a commit that referenced this pull request


          Mute org.elasticsearch.gradle.internal.InternalDistributionBwcSetupPl…

b36b8fc

…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870

DaveCTurner pushed a commit that referenced this pull request


          Mute org.elasticsearch.gradle.internal.InternalDistributionBwcSetupPl…

63558cb

…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870

DaveCTurner pushed a commit that referenced this pull request


          Mute org.elasticsearch.gradle.internal.InternalDistributionBwcSetupPl…

52a7652

…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet