[CI] Rolling upgrade test failure - failed to obtain in-memory shard lock #48395

romseygeek · 2019-10-23T13:21:10Z

All tests in TokenBackwardsCompatibilityIT failed when upgrading from 7.4.1 to 7.x latest with 'there are still tasks running' failures at cleanup time. Digging into the logs, it seems that the root of the failure is one of the upgraded master nodes failed to start due to a 'failed to obtain in-memory shard lock' error.

Build scan is here: https://gradle-enterprise.elastic.co/s/qycq6tmil2t3w/console-log?task=:x-pack:qa:rolling-upgrade:v7.4.1%23upgradedClusterTest#L2698

This has happened before, a few days ago, also in TokenBackwardsCompatibilityIT, upgrading from 6.8.4: https://gradle-enterprise.elastic.co/s/46trrv4mlxcle

elasticmachine · 2019-10-23T13:21:11Z

Pinging @elastic/es-security (:Security/Security)

elasticmachine · 2019-10-23T13:21:12Z

Pinging @elastic/es-distributed (:Distributed/Distributed)

jtibshirani · 2019-10-24T03:09:27Z

I just observed a failed intake build with similar symptoms (a 'failed to obtain in-memory shard lock' error, along with a build-up of cluster state tasks). Build scan: https://gradle-enterprise.elastic.co/s/f4vu4xgxiwkcs/.

ywelsch · 2019-10-24T16:13:09Z

It looks like something is DDoSing the master with persistent task updates.

ywelsch · 2019-10-24T16:24:15Z

I've opened #48483 to get more details on the tasks that are accumulating.

Relates #48395

ywelsch · 2019-10-25T12:31:35Z

Looking closer at this, it seems as if this is a reoccurrence of #39982

I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to the node on which it was previously allocated, but to which it can't be allocated now due to the shard lock being held because of an in-progress scroll. As the master keeps on retrying and retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing (see #48395). Closes #48395

romseygeek added >test-failure Triaged test failures from CI :Distributed Indexing/Distributed A catch all label for anything in the Distributed Indexing Area. Please avoid if you can. :Security/Security Security issues without another label labels Oct 23, 2019

ywelsch removed the :Security/Security Security issues without another label label Oct 24, 2019

ywelsch mentioned this issue Oct 24, 2019

Show task ID in source of persistent task state update #48483

Merged

ywelsch self-assigned this Oct 24, 2019

ywelsch added a commit that referenced this issue Oct 25, 2019

Show task ID in source of persistent task state update (#48483)

4698322

Relates #48395

ywelsch added a commit that referenced this issue Oct 25, 2019

Show task ID in source of persistent task state update (#48483)

486794f

Relates #48395

ywelsch added a commit that referenced this issue Oct 25, 2019

Show task ID in source of persistent task state update (#48483)

aefebb2

Relates #48395

ywelsch mentioned this issue Oct 25, 2019

Fix upgraded_scroll test #48525

Merged

ywelsch closed this as completed in #48525 Oct 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Rolling upgrade test failure - failed to obtain in-memory shard lock #48395

[CI] Rolling upgrade test failure - failed to obtain in-memory shard lock #48395

romseygeek commented Oct 23, 2019

elasticmachine commented Oct 23, 2019

Uh oh!

elasticmachine commented Oct 23, 2019

Uh oh!

jtibshirani commented Oct 24, 2019

Uh oh!

ywelsch commented Oct 24, 2019

Uh oh!

ywelsch commented Oct 24, 2019 •

edited

Loading

Uh oh!

ywelsch commented Oct 25, 2019

Uh oh!

[CI] Rolling upgrade test failure - failed to obtain in-memory shard lock #48395

[CI] Rolling upgrade test failure - failed to obtain in-memory shard lock #48395

Comments

romseygeek commented Oct 23, 2019

elasticmachine commented Oct 23, 2019

Uh oh!

elasticmachine commented Oct 23, 2019

Uh oh!

jtibshirani commented Oct 24, 2019

Uh oh!

ywelsch commented Oct 24, 2019

Uh oh!

ywelsch commented Oct 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ywelsch commented Oct 25, 2019

Uh oh!

ywelsch commented Oct 24, 2019 •

edited

Loading