Fix upgraded_scroll test #48525

ywelsch · 2019-10-25T13:09:54Z

I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to the node on which it was previously allocated, but to which it can't be allocated now due to the shard lock being held because of an in-progress scroll. As the master keeps on retrying and retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing (see #48395). Evidence:

[2019-10-23T11:59:42,872][INFO ][o.e.c.m.MetaDataCreateIndexService] [v7.4.1-2] [upgraded_scroll] creating index, cause [api], templates [template], shards [5]/[1], mappings []

[2019-10-23T11:59:43,280][INFO ][o.e.c.r.a.AllocationService] [v7.4.1-2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[upgraded_scroll][4]]]).

[2019-10-23T12:00:31,294][WARN ][o.e.i.c.IndicesClusterStateService] [v7.4.1-1] [upgraded_scroll][0] marking and sending shard failed due to [failed to create shard]
java.io.IOException: failed to obtain in-memory shard lock
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:446) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:658) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:165) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:610) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:586) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:266) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:517) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_221]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:514) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:485) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:432) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.access$100(ClusterApplierService.java:73) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:176) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
Caused by: org.elasticsearch.env.ShardLockObtainFailedException: [upgraded_scroll][0]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]
	at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:769) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:684) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	at org.elasticsearch.index.IndexService.createShard(IndexService.java:366) ~[elasticsearch-7.6.0-SNAPSHOT.jar:7.6.0-SNAPSHOT]
	... 18 more

... shard failure every 5 seconds due to new relocation attempt, ongoing for a very long time, and tasks piling up.

Closes #48395

elasticmachine · 2019-10-25T13:09:56Z

Pinging @elastic/es-distributed (:Distributed/Allocation)

ywelsch · 2019-10-28T13:16:07Z

@elasticmachine run elasticsearch-ci/packaging-sample-matrix

dnhatn

LGTM

x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/mixed_cluster/10_basic.yml

I think the problem is that the master is trying to relocate the "upgraded_scroll" shard back to the node on which it was previously allocated, but to which it can't be allocated now due to the shard lock being held because of an in-progress scroll. As the master keeps on retrying and retrying (and indefinitely tries so because max_retries does not apply to relocations, it blocks any other lower-prioritized task from completing, which leads to the rolling upgrade tests failing (see #48395). Closes #48395

Fix upgraded_scroll test

b20a102

ywelsch added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 v7.6.0 v7.4.2 labels Oct 25, 2019

ywelsch assigned DaveCTurner Oct 25, 2019

ywelsch requested review from dnhatn and DaveCTurner October 28, 2019 13:16

ywelsch unassigned DaveCTurner Oct 28, 2019

dnhatn approved these changes Oct 28, 2019

View reviewed changes

x-pack/qa/rolling-upgrade/src/test/resources/rest-api-spec/test/mixed_cluster/10_basic.yml Show resolved Hide resolved

ywelsch merged commit e3cc248 into elastic:master Oct 29, 2019

jtibshirani mentioned this pull request Dec 6, 2019

rolling upgrade fails test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #46529

Closed

dliappis mentioned this pull request Dec 13, 2019

[CI] Rolling upgrade tests fail in org.elasticsearch.upgrades.UpgradeClusterClientYamlTestSuiteIT.test {p0=upgraded_cluster/10_basic/Continue scroll after upgrade} #50172

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix upgraded_scroll test #48525

Fix upgraded_scroll test #48525

Uh oh!

ywelsch commented Oct 25, 2019

Uh oh!

elasticmachine commented Oct 25, 2019

Uh oh!

ywelsch commented Oct 28, 2019

Uh oh!

dnhatn left a comment

Uh oh!

Uh oh!

Uh oh!

Fix upgraded_scroll test #48525

Fix upgraded_scroll test #48525

Uh oh!

Conversation

ywelsch commented Oct 25, 2019

Uh oh!

elasticmachine commented Oct 25, 2019

Uh oh!

ywelsch commented Oct 28, 2019

Uh oh!

dnhatn left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!