-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Ensure MockRepository is Unblocked on Node Close #62711
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure MockRepository is Unblocked on Node Close #62711
Conversation
`RepositoriesService#doClose` was never called which lead to mock repositories not unblocking until the `ThreadPool` interrupts all threads. Thus stopping a node that is blocked on a mock repository operation wastes `10s` in each test that does it (which is quite a few as it turns out).
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
if (blockExecution() && waitAfterUnblock > 0) { | ||
final boolean wasBlocked = blockExecution(); | ||
if (wasBlocked && lifecycle.stoppedOrClosed()) { | ||
throw new IOException("already closed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We didn't throw here before but then again we only got here when all the threads were already interrupted -> I figured throwing here keeps things nice and deterministic.
logger.info("--> stopping master node"); | ||
internalCluster().stopCurrentMasterNode(); | ||
|
||
ensureStableCluster(nodeCount - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to actually wait here for the cluster change to be fully registerd, otherwise we just randomly pick a node that hasn't yet seen the cleanup in progress in the CS and fail on the leaked running cleanup.
Obviously, there's a bit of a risk with this change in general and it might lead to more failures that need a check like this added now because they implicitly relied on the 10s wait when closing a blocked node but IMO it's worth it given the almost 10s per affected test (and it's quite a few) savings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any. But I agree with you, it's better to not have test relying on the implicit 10s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any
Ran it all night on my local CI :D only shook out an endless series of #62713 for now :) I'm more worried about some low-frequency timing issues (from request retries) but now that we're aware of it, it should be easy to track those down if they actually occur :)
Jenkins run elasticsearch-ci/packaging-sample-windows |
1 similar comment
Jenkins run elasticsearch-ci/packaging-sample-windows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice find @original-brownbear ! I'm surprised we never noticed this before.
logger.info("--> stopping master node"); | ||
internalCluster().stopCurrentMasterNode(); | ||
|
||
ensureStableCluster(nodeCount - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any. But I agree with you, it's better to not have test relying on the implicit 10s.
We (both you and I were involved) did in #48020 but for whatever reason mixed up |
`RepositoriesService#doClose` was never called which lead to mock repositories not unblocking until the `ThreadPool` interrupts all threads. Thus stopping a node that is blocked on a mock repository operation wastes `10s` in each test that does it (which is quite a few as it turns out).
`RepositoriesService#doClose` was never called which lead to mock repositories not unblocking until the `ThreadPool` interrupts all threads. Thus stopping a node that is blocked on a mock repository operation wastes `10s` in each test that does it (which is quite a few as it turns out).
RepositoriesService#doClose
was never called which lead tomock repositories not unblocking until the
ThreadPool
interruptsall threads. Thus stopping a node that is blocked on a mock repository operation wastes
10s
in each test that does it (which is quite a few as it turns out).