Ensure MockRepository is Unblocked on Node Close #62711

original-brownbear · 2020-09-21T15:28:40Z

RepositoriesService#doClose was never called which lead to
mock repositories not unblocking until the ThreadPool interrupts
all threads. Thus stopping a node that is blocked on a mock repository operation wastes 10s
in each test that does it (which is quite a few as it turns out).

`RepositoriesService#doClose` was never called which lead to mock repositories not unblocking until the `ThreadPool` interrupts all threads. Thus stopping a node that is blocked on a mock repository operation wastes `10s` in each test that does it (which is quite a few as it turns out).

elasticmachine · 2020-09-21T15:28:42Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2020-09-21T15:29:27Z

test/framework/src/main/java/org/elasticsearch/snapshots/mockstore/MockRepository.java

-                if (blockExecution() && waitAfterUnblock > 0) {
+                final boolean wasBlocked = blockExecution();
+                if (wasBlocked && lifecycle.stoppedOrClosed()) {
+                    throw new IOException("already closed");


We didn't throw here before but then again we only got here when all the threads were already interrupted -> I figured throwing here keeps things nice and deterministic.

…lose

original-brownbear · 2020-09-21T17:29:58Z

...lClusterTest/java/org/elasticsearch/repositories/blobstore/BlobStoreRepositoryCleanupIT.java

        logger.info("-->  stopping master node");
        internalCluster().stopCurrentMasterNode();

+        ensureStableCluster(nodeCount - 1);


We need to actually wait here for the cluster change to be fully registerd, otherwise we just randomly pick a node that hasn't yet seen the cleanup in progress in the CS and fail on the leaked running cleanup.
Obviously, there's a bit of a risk with this change in general and it might lead to more failures that need a check like this added now because they implicitly relied on the 10s wait when closing a blocked node but IMO it's worth it given the almost 10s per affected test (and it's quite a few) savings.

We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any. But I agree with you, it's better to not have test relying on the implicit 10s.

We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any

Ran it all night on my local CI :D only shook out an endless series of #62713 for now :) I'm more worried about some low-frequency timing issues (from request retries) but now that we're aware of it, it should be easy to track those down if they actually occur :)

original-brownbear · 2020-09-21T17:41:43Z

Jenkins run elasticsearch-ci/packaging-sample-windows

original-brownbear · 2020-09-21T18:02:22Z

Jenkins run elasticsearch-ci/packaging-sample-windows

tlrx

Nice find @original-brownbear ! I'm surprised we never noticed this before.

tlrx · 2020-09-22T07:37:56Z

...lClusterTest/java/org/elasticsearch/repositories/blobstore/BlobStoreRepositoryCleanupIT.java

        logger.info("-->  stopping master node");
        internalCluster().stopCurrentMasterNode();

+        ensureStableCluster(nodeCount - 1);


We can maybe run this PR on CI multiple times before merging, just to catch the most failing tests if any. But I agree with you, it's better to not have test relying on the implicit 10s.

original-brownbear · 2020-09-22T08:03:40Z

Nice find @original-brownbear ! I'm surprised we never noticed this before.

We (both you and I were involved) did in #48020 but for whatever reason mixed up stop and close there so that fix never fully worked out (I can't for the life of me figure out why I was able to reproduce this back then but then messed up the fix in this subtle way ... sorry about that, now it should be all good though :)).

`RepositoriesService#doClose` was never called which lead to mock repositories not unblocking until the `ThreadPool` interrupts all threads. Thus stopping a node that is blocked on a mock repository operation wastes `10s` in each test that does it (which is quite a few as it turns out).

original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.10.0 labels Sep 21, 2020

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Sep 21, 2020

original-brownbear commented Sep 21, 2020

View reviewed changes

original-brownbear marked this pull request as draft September 21, 2020 16:05

original-brownbear added WIP and removed v7.10.0 v8.0.0 labels Sep 21, 2020

original-brownbear added 2 commits September 21, 2020 18:09

Merge remote-tracking branch 'elastic/master' into faster-mock-repo-c…

93d3b74

…lose

fix

79578ac

original-brownbear marked this pull request as ready for review September 21, 2020 17:18

original-brownbear added v7.10.0 v8.0.0 and removed WIP labels Sep 21, 2020

original-brownbear commented Sep 21, 2020

View reviewed changes

original-brownbear requested a review from tlrx September 21, 2020 19:16

tlrx approved these changes Sep 22, 2020

View reviewed changes

original-brownbear merged commit 86ba0b2 into elastic:master Sep 22, 2020

original-brownbear deleted the faster-mock-repo-close branch September 22, 2020 08:04

original-brownbear mentioned this pull request Sep 22, 2020

Ensure MockRepository is Unblocked on Node Close (#62711) #62748

Merged

original-brownbear mentioned this pull request Oct 2, 2020

Unblock blocked repositories after test execution #61703

Closed

original-brownbear restored the faster-mock-repo-close branch December 6, 2020 19:01

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ensure MockRepository is Unblocked on Node Close #62711

Ensure MockRepository is Unblocked on Node Close #62711

Uh oh!

original-brownbear commented Sep 21, 2020

Uh oh!

elasticmachine commented Sep 21, 2020

Uh oh!

original-brownbear Sep 21, 2020

Uh oh!

original-brownbear Sep 21, 2020 •

edited

Loading

Uh oh!

tlrx Sep 22, 2020

Uh oh!

original-brownbear Sep 22, 2020

Uh oh!

original-brownbear commented Sep 21, 2020

Uh oh!

original-brownbear commented Sep 21, 2020

Uh oh!

tlrx left a comment

Uh oh!

tlrx Sep 22, 2020

Uh oh!

original-brownbear commented Sep 22, 2020

Uh oh!

Uh oh!

Ensure MockRepository is Unblocked on Node Close #62711

Ensure MockRepository is Unblocked on Node Close #62711

Uh oh!

Conversation

original-brownbear commented Sep 21, 2020

Uh oh!

elasticmachine commented Sep 21, 2020

Uh oh!

original-brownbear Sep 21, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear Sep 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tlrx Sep 22, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear Sep 22, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Sep 21, 2020

Uh oh!

original-brownbear commented Sep 21, 2020

Uh oh!

tlrx left a comment

Choose a reason for hiding this comment

Uh oh!

tlrx Sep 22, 2020

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Sep 22, 2020

Uh oh!

Uh oh!

original-brownbear Sep 21, 2020 •

edited

Loading