index rollover running with "NORMAL" priority #50778

shwetathareja · 2020-01-09T02:05:06Z

Elasticsearch version (bin/elasticsearch --version): ES master branch

Description of the problem including expected versus actual behavior:
Index rollover cluster state update is running with "NORMAL" priority after this PR to make rollover execute in one cluster state update (thanks for fixing it). Before this change, the two steps 1) create index 2) alias-switch both used to run with "URGENT" priority. This can delay the rollover task and could again cause single index to grow huge if master is busy with other higher priority tasks like shard-started, shard-failed, update snapshot state etc.

clusterService.submitStateUpdateTask("rollover_index source [" + sourceIndexName + "] to target ["
                            + rolloverIndexName + "]", new ClusterStateUpdateTask() {

Fix: Update the priority for rollover task to "URGENT"
#50388

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-01-09T02:59:58Z

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

DaveCTurner · 2020-01-09T15:57:47Z

I do not think this should be an URGENT task. There's something very wrong with the cluster if the master does not get around to the NORMAL-priority tasks in a reasonably short amount of time, and the solution to that is definitely not to promote things to URGENT.

Note that we split up the historically-expensive shard-started task into a cheap URGENT part and a more expensive NORMAL part in #44433, and batched the more expensive parts together to avoid duplicate work. If we are still seeing URGENT-level tasks running unduly slowly then I would like more details so we can continue in this direction.

gwbrown · 2020-01-09T17:57:50Z

We have seen an issue in larger clusters where it often takes longer than the 30s default to process a rollover, which in currently-released versions of ES will cause ILM to stop for an index until a user intervenes. This is definitely a problem as you say, because it can lead to indices growing very large.

However, we're already taking steps to address this problem in another way. In #50388 we've made Rollover a single cluster state update (rather than several in sequence), which enables us to implement automatic retries (see #48183) for rollover, and should help alleviate this problem without having to adjust the priority of the task.

shwetathareja · 2020-01-10T07:57:53Z

Thanks @DaveCTurner and @gwbrown for your response.

Note that we split up the historically-expensive shard-started task into a cheap URGENT part and a more expensive NORMAL part in #44433, and batched the more expensive parts together to avoid duplicate work. If we are still seeing URGENT-level tasks running unduly slowly then I would like more details so we can continue in this direction.

Like you mentioned reroute is an "expensive" NORMAL task, based on insertion order it can still delay the rollover task which will timeout after default 30secs waiting in the queue. With "NORMAL" priority it is competing not only with URGENT/ HIGH but with "NORMAL" priority tasks as well. And, rollover operation not performed in time could cause single index to grow for high ingestion rate and can have more side effects in the cluster. individual create-index/ alias-switch tasks run with "URGENT" priority but why rollover should have lower priority than that (considering all are customer initiated actions). I would like to understand the criteria based on which priority is decided for various tasks.

However, we're already taking steps to address this problem in another way. In #50388 we've made Rollover a single cluster state update (rather than several in sequence), which enables us to implement automatic retries (see #48183) for rollover, and should help alleviate this problem without having to adjust the priority of the task.

yes, thanks for the fix of rollover in a single cluster state update. This is really helpful.

dakrone · 2020-01-23T15:34:21Z

We discussed this today and decided that we'd rather not increase the priority of this task, instead, we have other ways to address it:

The rollover is now a single cluster state update instead of multiple, so it adds fewer things to the cluster state update queue
The rollover action is now retryable, so even if it does time out, it can be retried at the next stage (instead of staying in the ERROR step)
We are making the master node timeout configurable in the event it needs to be increased

As David said, we should solve the underlying issue rather than increasing the priority, as once more things move to URGENT, fewer things actually end up "being" urgently executed.

With that I'm going to close this issue. Thanks!

dakrone added :Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss labels Jan 9, 2020

dakrone closed this as completed Jan 23, 2020

DaveCTurner mentioned this issue Feb 3, 2020

Making delete, close and update-settings index IMMEDIATE in pending tasks #51781

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index rollover running with "NORMAL" priority #50778

index rollover running with "NORMAL" priority #50778

shwetathareja commented Jan 9, 2020

elasticmachine commented Jan 9, 2020

DaveCTurner commented Jan 9, 2020 •

edited

Loading

gwbrown commented Jan 9, 2020

shwetathareja commented Jan 10, 2020 •

edited

Loading

dakrone commented Jan 23, 2020

index rollover running with "NORMAL" priority #50778

index rollover running with "NORMAL" priority #50778

Comments

shwetathareja commented Jan 9, 2020

elasticmachine commented Jan 9, 2020

DaveCTurner commented Jan 9, 2020 • edited Loading

gwbrown commented Jan 9, 2020

shwetathareja commented Jan 10, 2020 • edited Loading

dakrone commented Jan 23, 2020

DaveCTurner commented Jan 9, 2020 •

edited

Loading

shwetathareja commented Jan 10, 2020 •

edited

Loading