Skip to content

index rollover running with "NORMAL" priority #50778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shwetathareja opened this issue Jan 9, 2020 · 5 comments
Closed

index rollover running with "NORMAL" priority #50778

shwetathareja opened this issue Jan 9, 2020 · 5 comments
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss

Comments

@shwetathareja
Copy link

Elasticsearch version (bin/elasticsearch --version): ES master branch

Description of the problem including expected versus actual behavior:
Index rollover cluster state update is running with "NORMAL" priority after this PR to make rollover execute in one cluster state update (thanks for fixing it). Before this change, the two steps 1) create index 2) alias-switch both used to run with "URGENT" priority. This can delay the rollover task and could again cause single index to grow huge if master is busy with other higher priority tasks like shard-started, shard-failed, update snapshot state etc.

clusterService.submitStateUpdateTask("rollover_index source [" + sourceIndexName + "] to target ["
                            + rolloverIndexName + "]", new ClusterStateUpdateTask() {

Fix: Update the priority for rollover task to "URGENT"
#50388

@dakrone dakrone added :Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss labels Jan 9, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (:Core/Features/ILM+SLM)

@DaveCTurner
Copy link
Contributor

DaveCTurner commented Jan 9, 2020

I do not think this should be an URGENT task. There's something very wrong with the cluster if the master does not get around to the NORMAL-priority tasks in a reasonably short amount of time, and the solution to that is definitely not to promote things to URGENT.

Note that we split up the historically-expensive shard-started task into a cheap URGENT part and a more expensive NORMAL part in #44433, and batched the more expensive parts together to avoid duplicate work. If we are still seeing URGENT-level tasks running unduly slowly then I would like more details so we can continue in this direction.

@gwbrown
Copy link
Contributor

gwbrown commented Jan 9, 2020

We have seen an issue in larger clusters where it often takes longer than the 30s default to process a rollover, which in currently-released versions of ES will cause ILM to stop for an index until a user intervenes. This is definitely a problem as you say, because it can lead to indices growing very large.

However, we're already taking steps to address this problem in another way. In #50388 we've made Rollover a single cluster state update (rather than several in sequence), which enables us to implement automatic retries (see #48183) for rollover, and should help alleviate this problem without having to adjust the priority of the task.

@shwetathareja
Copy link
Author

shwetathareja commented Jan 10, 2020

Thanks @DaveCTurner and @gwbrown for your response.

Note that we split up the historically-expensive shard-started task into a cheap URGENT part and a more expensive NORMAL part in #44433, and batched the more expensive parts together to avoid duplicate work. If we are still seeing URGENT-level tasks running unduly slowly then I would like more details so we can continue in this direction.

Like you mentioned reroute is an "expensive" NORMAL task, based on insertion order it can still delay the rollover task which will timeout after default 30secs waiting in the queue. With "NORMAL" priority it is competing not only with URGENT/ HIGH but with "NORMAL" priority tasks as well. And, rollover operation not performed in time could cause single index to grow for high ingestion rate and can have more side effects in the cluster. individual create-index/ alias-switch tasks run with "URGENT" priority but why rollover should have lower priority than that (considering all are customer initiated actions). I would like to understand the criteria based on which priority is decided for various tasks.

However, we're already taking steps to address this problem in another way. In #50388 we've made Rollover a single cluster state update (rather than several in sequence), which enables us to implement automatic retries (see #48183) for rollover, and should help alleviate this problem without having to adjust the priority of the task.

yes, thanks for the fix of rollover in a single cluster state update. This is really helpful.

@dakrone
Copy link
Member

dakrone commented Jan 23, 2020

We discussed this today and decided that we'd rather not increase the priority of this task, instead, we have other ways to address it:

  • The rollover is now a single cluster state update instead of multiple, so it adds fewer things to the cluster state update queue
  • The rollover action is now retryable, so even if it does time out, it can be retried at the next stage (instead of staying in the ERROR step)
  • We are making the master node timeout configurable in the event it needs to be increased

As David said, we should solve the underlying issue rather than increasing the priority, as once more things move to URGENT, fewer things actually end up "being" urgently executed.

With that I'm going to close this issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/ILM+SLM Index and Snapshot lifecycle management team-discuss
Projects
None yet
Development

No branches or pull requests

5 participants