-
Notifications
You must be signed in to change notification settings - Fork 0
Get test working #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get test working #1
Conversation
assertTrue("rerouted message received exactly once", reroutedMessageReceived.compareAndSet(false, true)); | ||
handler.messageReceived(request, channel, task); | ||
} | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the test to use TransportClusterHealthAction
as the redirected request instead of ClusterStateAction
because InternalTestCluster#getMasterName()
(used by masterClient
etc.) triggers a ClusterStateAction
, which was making the received-at-most-once assertions invalid.
c5e4b9b
to
ce14d2e
Compare
...nalClusterTest/java/org/elasticsearch/action/support/master/TransportMasterNodeActionIT.java
Show resolved
Hide resolved
@@ -36,39 +39,49 @@ protected Collection<Class<? extends Plugin>> nodePlugins() { | |||
return CollectionUtils.appendToCopy(super.nodePlugins(), MockTransportService.TestPlugin.class); | |||
} | |||
|
|||
@TestLogging(reason = "wip", value = "org.elasticsearch.transport.TransportService.tracer:TRACE") | |||
@TestLogging(reason = "wip", value = "org.elasticsearch.action.support.master.TransportMasterNodeAction:DEBUG") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left this in so you can see
[2024-06-19T07:58:56,657][DEBUG][o.e.a.s.m.TransportMasterNodeAction] [node_s5] request routed to master in term [2] but local term is [1], waiting for local term bump
...nalClusterTest/java/org/elasticsearch/action/support/master/TransportMasterNodeActionIT.java
Outdated
Show resolved
Hide resolved
} finally { | ||
newMasterClusterService.removeApplier(clusterFormationMonitor); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a bit hacky, but I struggled to see how we could
- Block the applier on newMaster to prevent it clearing the current master (in
becomeCandidate
) - Allow newMaster to send a joinRequest to itself via the cluster applier (in
JoinHelper
) - Block the applier on newMaster to prevent it applying the cluster state after it wins an election
I think (although with all the asynchrony, I'm not sure) that the above things happen in the above order, so I don't know how we could block 1 and 3 whilst allowing 2 to go through - and if we could, would the extra machinery just make the test harder to read and more brittle?
That means we can't rely on our own join, and because the old master sometimes(?) accepts its failed update, we can't rely on its join, so we can't reliably get a quorum in a 3 node cluster. If we bump the cluster size up to 5 nodes we seem to succeed every time.
safeAwait(electionWonLatch); | ||
// Wait until the old master has acknowledged the new master's election | ||
safeAwait(previousMasterKnowsNewMasterIsElectedLatch); | ||
logger.info("New master is elected"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we're preventing cluster state updates being applied on newMaster
, we watch the previous master's cluster state to determine when we've won the election. We need to wait for the previous master to apply the term bump anyway before we send the to-be-redirected request so this serves multiple purposes.
fc88ff4
into
DaveCTurner:2024/05/22/versioned-master-node-requests
I'm merging this into the PR branch so I can iterate on it a bit more |
…sion (#1…" (elastic#115827) This reverts commit 32dee6a.
…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870
…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870
…uginFuncTest builds distribution from branches via archives extractedAssemble [bwcDistVersion: 8.3.0, bwcProject: staged, expectedAssembleTaskName: extractedAssemble, #1] elastic#119870
Initial work on getting the test working
This test now appears to reliably reproduce the issue that was fixed. I verified it by removing the
request.masterTerm
check fromTransportMasterNodeAction
.Proposal/Query
I know I've missed something, but would it not be less fragile and maybe easier to reason about if instead of forwarding
MasterNodeRequest
s on to the master recursively, we recorded how many hops we are from the coordinating node and throw aStaleRoutingStateException
if we get to hop 1 and the receiver not the master? (i.e. if we were forwarded aMasterNodeRequest
despite us not being the master).The
StaleRoutingStateException
could include the routed-to node's current term so that the forwarding node could delay until it reached that term before attempting to forward again (or just apply some reasonable delay if the receiver is lagging).With this approach we could put the coordinating node solely in charge of retrying the request when it fails due to a
StaleRoutingStateException
which I think would fix the duplicate tasks issue resulting from a deeply re-routed request (which I think can still occur when a request is legitimately chasing a master around the cluster?).