Skip to content

[CI] Rolling upgrade test failure - failed to obtain in-memory shard lock #48395

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
romseygeek opened this issue Oct 23, 2019 · 6 comments · Fixed by #48525
Closed

[CI] Rolling upgrade test failure - failed to obtain in-memory shard lock #48395

romseygeek opened this issue Oct 23, 2019 · 6 comments · Fixed by #48525
Assignees
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >test-failure Triaged test failures from CI

Comments

@romseygeek
Copy link
Contributor

All tests in TokenBackwardsCompatibilityIT failed when upgrading from 7.4.1 to 7.x latest with 'there are still tasks running' failures at cleanup time. Digging into the logs, it seems that the root of the failure is one of the upgraded master nodes failed to start due to a 'failed to obtain in-memory shard lock' error.

Build scan is here: https://gradle-enterprise.elastic.co/s/qycq6tmil2t3w/console-log?task=:x-pack:qa:rolling-upgrade:v7.4.1%23upgradedClusterTest#L2698

This has happened before, a few days ago, also in TokenBackwardsCompatibilityIT, upgrading from 6.8.4: https://gradle-enterprise.elastic.co/s/46trrv4mlxcle

@romseygeek romseygeek added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. :Security/Security Security issues without another label labels Oct 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-security (:Security/Security)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Distributed)

@jtibshirani
Copy link
Contributor

I just observed a failed intake build with similar symptoms (a 'failed to obtain in-memory shard lock' error, along with a build-up of cluster state tasks). Build scan: https://gradle-enterprise.elastic.co/s/f4vu4xgxiwkcs/.

@ywelsch ywelsch removed the :Security/Security Security issues without another label label Oct 24, 2019
@ywelsch
Copy link
Contributor

ywelsch commented Oct 24, 2019

It looks like something is DDoSing the master with persistent task updates.

@ywelsch
Copy link
Contributor

ywelsch commented Oct 24, 2019

I've opened #48483 to get more details on the tasks that are accumulating.

@ywelsch
Copy link
Contributor

ywelsch commented Oct 25, 2019

Looking closer at this, it seems as if this is a reoccurrence of #39982

ywelsch added a commit that referenced this issue Oct 29, 2019
I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to
the node on which it was previously allocated, but to which it can't be allocated now due to the
shard lock being held because of an in-progress scroll. As the master keeps on retrying and
retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks
any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing
(see #48395). 

Closes #48395
ywelsch added a commit that referenced this issue Oct 29, 2019
I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to
the node on which it was previously allocated, but to which it can't be allocated now due to the
shard lock being held because of an in-progress scroll. As the master keeps on retrying and
retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks
any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing
(see #48395). 

Closes #48395
ywelsch added a commit that referenced this issue Oct 29, 2019
I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to
the node on which it was previously allocated, but to which it can't be allocated now due to the
shard lock being held because of an in-progress scroll. As the master keeps on retrying and
retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks
any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing
(see #48395). 

Closes #48395
ywelsch added a commit that referenced this issue Oct 29, 2019
I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to
the node on which it was previously allocated, but to which it can't be allocated now due to the
shard lock being held because of an in-progress scroll. As the master keeps on retrying and
retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks
any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing
(see #48395). 

Closes #48395
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants