Skip to content

Ensure MockRepository is Unblocked on Node Close #62711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

original-brownbear
Copy link
Member

RepositoriesService#doClose was never called which lead to
mock repositories not unblocking until the ThreadPool interrupts
all threads. Thus stopping a node that is blocked on a mock repository operation wastes 10s
in each test that does it (which is quite a few as it turns out).

`RepositoriesService#doClose` was never called which lead to
mock repositories not unblocking until the `ThreadPool` interrupts
all threads. Thus stopping a node that is blocked on a mock repository operation wastes `10s`
in each test that does it (which is quite a few as it turns out).
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.10.0 labels Sep 21, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 21, 2020
if (blockExecution() && waitAfterUnblock > 0) {
final boolean wasBlocked = blockExecution();
if (wasBlocked && lifecycle.stoppedOrClosed()) {
throw new IOException("already closed");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We didn't throw here before but then again we only got here when all the threads were already interrupted -> I figured throwing here keeps things nice and deterministic.

@original-brownbear original-brownbear marked this pull request as ready for review September 21, 2020 17:18
logger.info("--> stopping master node");
internalCluster().stopCurrentMasterNode();

ensureStableCluster(nodeCount - 1);
Copy link
Member Author

@original-brownbear original-brownbear Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to actually wait here for the cluster change to be fully registerd, otherwise we just randomly pick a node that hasn't yet seen the cleanup in progress in the CS and fail on the leaked running cleanup.
Obviously, there's a bit of a risk with this change in general and it might lead to more failures that need a check like this added now because they implicitly relied on the 10s wait when closing a blocked node but IMO it's worth it given the almost 10s per affected test (and it's quite a few) savings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any. But I agree with you, it's better to not have test relying on the implicit 10s.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any

Ran it all night on my local CI :D only shook out an endless series of #62713 for now :) I'm more worried about some low-frequency timing issues (from request retries) but now that we're aware of it, it should be easy to track those down if they actually occur :)

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/packaging-sample-windows

1 similar comment
@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/packaging-sample-windows

Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice find @original-brownbear ! I'm surprised we never noticed this before.

logger.info("--> stopping master node");
internalCluster().stopCurrentMasterNode();

ensureStableCluster(nodeCount - 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any. But I agree with you, it's better to not have test relying on the implicit 10s.

@original-brownbear
Copy link
Member Author

Nice find @original-brownbear ! I'm surprised we never noticed this before.

We (both you and I were involved) did in #48020 but for whatever reason mixed up stop and close there so that fix never fully worked out (I can't for the life of me figure out why I was able to reproduce this back then but then messed up the fix in this subtle way ... sorry about that, now it should be all good though :)).

@original-brownbear original-brownbear merged commit 86ba0b2 into elastic:master Sep 22, 2020
@original-brownbear original-brownbear deleted the faster-mock-repo-close branch September 22, 2020 08:04
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Sep 22, 2020
`RepositoriesService#doClose` was never called which lead to
mock repositories not unblocking until the `ThreadPool` interrupts
all threads. Thus stopping a node that is blocked on a mock repository operation wastes `10s`
in each test that does it (which is quite a few as it turns out).
original-brownbear added a commit that referenced this pull request Sep 22, 2020
`RepositoriesService#doClose` was never called which lead to
mock repositories not unblocking until the `ThreadPool` interrupts
all threads. Thus stopping a node that is blocked on a mock repository operation wastes `10s`
in each test that does it (which is quite a few as it turns out).
@original-brownbear original-brownbear restored the faster-mock-repo-close branch December 6, 2020 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants