-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[Zen2] Add safety phase to CoordinatorTests #34241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Zen2] Add safety phase to CoordinatorTests #34241
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is not perfect: for instance it fails on the following command line due to a lack of lag detection. I was hoping we could avoid lag detection for now, but this hope is misplaced. I'm not sure whether to avoid asserting the lack of lag or just to live with this for now.
./gradlew :server:test -Dtests.class=org.elasticsearch.cluster.coordination.CoordinatorTests -Dtests.jvm.argline=-Dhppc.bitmixer=DETERMINISTIC -Dtests.seed=F3285F709714F4:EBE3359BD09053D5 -Dtests.method=testNodesJoinAfterStableCluster
server/src/main/java/org/elasticsearch/common/util/concurrent/ListenableFuture.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
Show resolved
Hide resolved
- use isConnectedPair rather than looking at disconnected/blackholed sets - don't expect the follower to have a good state (no lag detection) - check that the leader's state is exactly the nodes to which it is connected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few comments. Looks great already!
server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/discovery/PeerFinder.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/CoordinatorTests.java
Show resolved
Hide resolved
I've addressed all the comments so this is worth another look @ywelsch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -347,6 +347,9 @@ void addNodes(int newNodesCount) { | |||
|
|||
void runRandomly() { | |||
|
|||
assert disconnectedNodes.isEmpty() : "may reconnect disconnected nodes, probably unexpected: " + disconnectedNodes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe use hamcrest matchers? assertThat(disconnectedNodes, empty())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I pushed a9f3e00.
This reverts commit 1a58b48.
I merged the latest good As of 29ad624 the tests now seem robust but not 100% solid. After 1700 iterations we found a failure with:
This fails when forming a 4-node cluster - one of the nodes managed to end up in a term higher than the rest of them, and we lack term bumping so there's no way for it to recover. |
Today's CoordinatorTests have a limited amount of randomisation in how things
are scheduled. However, to be fully confident in Zen2's liveness we require the
system to stabilise after any permitted sequence of events. We can achieve
this by running the system in a much more random fashion for a while, with much
larger variation in when things are scheduled (simulating GC pressure and
network disruption) and then continuing to assert that the system stabilises as
we expect. When running randomly, we do not expect to make significant progress
and merely verify that no safety property is violated.
This change introduces the runRandomly() test method which implements this
idea. It also fixes a handful of liveness bugs that this first version of
runRandomly() exposed.