Skip to content

Get test working #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

nicktindall
Copy link

@nicktindall nicktindall commented Jun 18, 2024

Initial work on getting the test working

This test now appears to reliably reproduce the issue that was fixed. I verified it by removing the request.masterTerm check from TransportMasterNodeAction.

Proposal/Query

I know I've missed something, but would it not be less fragile and maybe easier to reason about if instead of forwarding MasterNodeRequests on to the master recursively, we recorded how many hops we are from the coordinating node and throw a StaleRoutingStateException if we get to hop 1 and the receiver not the master? (i.e. if we were forwarded a MasterNodeRequest despite us not being the master).

The StaleRoutingStateException could include the routed-to node's current term so that the forwarding node could delay until it reached that term before attempting to forward again (or just apply some reasonable delay if the receiver is lagging).

With this approach we could put the coordinating node solely in charge of retrying the request when it fails due to a StaleRoutingStateException which I think would fix the duplicate tasks issue resulting from a deeply re-routed request (which I think can still occur when a request is legitimately chasing a master around the cluster?).

assertTrue("rerouted message received exactly once", reroutedMessageReceived.compareAndSet(false, true));
handler.messageReceived(request, channel, task);
}
);
Copy link
Author

@nicktindall nicktindall Jun 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the test to use TransportClusterHealthAction as the redirected request instead of ClusterStateAction because InternalTestCluster#getMasterName() (used by masterClient etc.) triggers a ClusterStateAction, which was making the received-at-most-once assertions invalid.

@nicktindall nicktindall force-pushed the 2024/05/22/versioned-master-node-requests branch from c5e4b9b to ce14d2e Compare June 18, 2024 05:35
@@ -36,39 +39,49 @@ protected Collection<Class<? extends Plugin>> nodePlugins() {
return CollectionUtils.appendToCopy(super.nodePlugins(), MockTransportService.TestPlugin.class);
}

@TestLogging(reason = "wip", value = "org.elasticsearch.transport.TransportService.tracer:TRACE")
@TestLogging(reason = "wip", value = "org.elasticsearch.action.support.master.TransportMasterNodeAction:DEBUG")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left this in so you can see

[2024-06-19T07:58:56,657][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [node_s5] request routed to master in term [2] but local term is [1], waiting for local term bump

} finally {
newMasterClusterService.removeApplier(clusterFormationMonitor);
}
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit hacky, but I struggled to see how we could

  1. Block the applier on newMaster to prevent it clearing the current master (in becomeCandidate)
  2. Allow newMaster to send a joinRequest to itself via the cluster applier (in JoinHelper)
  3. Block the applier on newMaster to prevent it applying the cluster state after it wins an election

I think (although with all the asynchrony, I'm not sure) that the above things happen in the above order, so I don't know how we could block 1 and 3 whilst allowing 2 to go through - and if we could, would the extra machinery just make the test harder to read and more brittle?

That means we can't rely on our own join, and because the old master sometimes(?) accepts its failed update, we can't rely on its join, so we can't reliably get a quorum in a 3 node cluster. If we bump the cluster size up to 5 nodes we seem to succeed every time.

safeAwait(electionWonLatch);
// Wait until the old master has acknowledged the new master's election
safeAwait(previousMasterKnowsNewMasterIsElectedLatch);
logger.info("New master is elected");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we're preventing cluster state updates being applied on newMaster, we watch the previous master's cluster state to determine when we've won the election. We need to wait for the previous master to apply the term bump anyway before we send the to-be-redirected request so this serves multiple purposes.

@DaveCTurner DaveCTurner merged commit fc88ff4 into DaveCTurner:2024/05/22/versioned-master-node-requests Jul 2, 2024
@DaveCTurner
Copy link
Owner

I'm merging this into the PR branch so I can iterate on it a bit more

DaveCTurner pushed a commit that referenced this pull request Oct 29, 2024
DaveCTurner pushed a commit that referenced this pull request Jan 13, 2025
…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870
DaveCTurner pushed a commit that referenced this pull request Mar 14, 2025
…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870
DaveCTurner pushed a commit that referenced this pull request Mar 17, 2025
…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants