Skip to content

Fix upgraded_scroll test #48525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 29, 2019
Merged

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Oct 25, 2019

I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to the node on which it was previously allocated, but to which it can't be allocated now due to the shard lock being held because of an in-progress scroll. As the master keeps on retrying and retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing (see #48395). Evidence:

[2019-10-23T11:59:42,872][INFO ][o.e.c.m.MetaDataCreateIndexService] [v7.4.1-2] [upgraded_scroll] creating index, cause [api], templates [template], shards [5]/[1], mappings []

[2019-10-23T11:59:43,280][INFO ][o.e.c.r.a.AllocationService] [v7.4.1-2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[upgraded_scroll][4]]]).

[2019-10-23T12:00:31,294][WARN ][o.e.i.c.IndicesClusterStateService] [v7.4.1-1] [upgraded_scroll][0] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:446) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:658) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:165) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:610) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:586) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:266) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_221]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [upgraded_scroll][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]
	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:769) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:684) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:366) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	... 18 more

... shard failure every 5 seconds due to new relocation attempt, ongoing for a very long time, and tasks piling up.

Closes #48395

@ywelsch ywelsch added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 v7.6.0 v7.4.2 labels Oct 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Allocation)

@ywelsch
Copy link
Contributor Author

ywelsch commented Oct 28, 2019

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ywelsch ywelsch merged commit e3cc248 into elastic:master Oct 29, 2019
ywelsch added a commit that referenced this pull request Oct 29, 2019
I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to
the node on which it was previously allocated, but to which it can't be allocated now due to the
shard lock being held because of an in-progress scroll. As the master keeps on retrying and
retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks
any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing
(see #48395). 

Closes #48395
ywelsch added a commit that referenced this pull request Oct 29, 2019
I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to
the node on which it was previously allocated, but to which it can't be allocated now due to the
shard lock being held because of an in-progress scroll. As the master keeps on retrying and
retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks
any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing
(see #48395). 

Closes #48395
ywelsch added a commit that referenced this pull request Oct 29, 2019
I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to
the node on which it was previously allocated, but to which it can't be allocated now due to the
shard lock being held because of an in-progress scroll. As the master keeps on retrying and
retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks
any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing
(see #48395). 

Closes #48395
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >test Issues or PRs that are addressing/adding tests v7.4.2 v7.5.0 v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] Rolling upgrade test failure - failed to obtain in-memory shard lock
5 participants