Simplify ILM Policy solution for managing lifecycle of rollup indices #70334

talevy · 2021-03-11T19:48:29Z

Currently, we require that if a user wants to manage the lifecycle of a rollup index created within an ILM Policy (within a Datastream, for example), that this index have a separate policy. Maintaining policies is not easy, and this makes it even more difficult. For this reason, we should re-explore solutions to this problem.

# Example ILM Policy with Rollup Action and new policy: "my-rollup-ilm-policy"

PUT _ilm/policy/my-policy
{
  "policy": {
    "phases": {
      "cold": {
        "actions": {
          "rollup": {
            "config": {
              "groups": {
                "date_histogram": {
                  "field": "@timestamp",
                  "calendar_interval": "1y"
                }
              },
              "metrics": [
                {
                  "field": "my-numeric-field",
                  "metrics": [
                    "avg"
                  ]
                }
              ]
            },
            "rollup_policy": "my-rollup-ilm-policy"
          }
        }
      }
    }
  }
}

One solution is to limit the actions that work for rollup indices. For example, maybe a rollup index must share the same lifecycle/policy as the original index for all phases except the delete phase. This would allow rollup indices to be deleted at a later time.

The policy for this may look like:

PUT _ilm/policy/my-policy
{
  "policy": {
    "phases": {
      "cold": {
        "actions": {
          "rollup": {
            "config": {
              "groups": {
                "date_histogram": {
                  "field": "@timestamp",
                  "calendar_interval": "1y"
                }
              },
              "metrics": [
                {
                  "field": "my-numeric-field",
                  "metrics": [
                    "avg"
                  ]
                }
              ]
            }
          }
        }
      }
    },
    "delete": {
      "after": "1d",
      "after_rollup": "1y",
      "actions": { "delete": {} }
    }
  }
}

Potentially a separate after parameter just for the rollup index created by this policy can be used. Thus managing both original and rollup with minimal modification and no additional policy to maintain.

This is just one option, may come up with others.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-11T19:48:32Z

Pinging @elastic/es-analytics-geo (Team:Analytics)

cjcenizal · 2021-03-11T19:54:53Z

CC @jloleysens @jethr0null

dakrone · 2021-03-11T21:07:50Z

Something it would be good to clarify is whether it is valid for a rollup index to "go back to a prior life".

What I mean by this is, let's assume that a user has an ILM policy that contains a rollup action is the cold phase. So the index looks like:

<index>
  |
  |
 hot
  |
  |
 warm
  |
  |
 cold
  |\
  | \
  |  \
  |   \
  |    \
  |     \
  |      \
  |       \
<index> <rollup>
  |        |
  |        |
frozen  frozen
  |        |
  |        |
delete  delete

Where the rollup is created in the cold phase, and then goes through all subsequent phases in the ILM lifecycle.

Is this what we want? Do we ever envision a need for the rollup index to start its lifecycle over? Does it ever need to go back through the hot, warm, and part of the cold phases to execute actions?

If the answer is "yes", then we may want to investigate solutions where the rollup index has a separate lifecycle from the original index entirely (separate policy, whatever that ends up looking like).
If the answer is "no", then maybe we can investigate a solution where certain actions like delete can be specified that affect only the rollup index (one idea could be a new phase delete_rollup that could specify its own min_age but only apply to the split rollup index).

The current implementation doesn't actually look like the graph above, because from my reading (please correct me if I'm wrong!) we don't copy over the lifecycle execution state, so the rollup index will start over in the new/init/init ILM step and go back through the entire policy if the same policy were specified. If by chance a user specifies the same policy as the parent index, then we hit a bunch of assertions and trip with something like:

java.lang.IllegalStateException: expected index [rollup-.ds-myindex-2021.03.11-000001-tiu0jc32twi6ilco0b-rra] with policy [my-policy] to have current step consistent with provided step key ({"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}) but it was {"phase":"hot","action":"rollover","name":"ERROR"}
	at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.maybeRunAsyncAction(IndexLifecycleRunner.java:286) ~[?:?]
	at org.elasticsearch.xpack.ilm.ExecuteStepsUpdateTask.clusterStateProcessed(ExecuteStepsUpdateTask.java:195) ~[?:?]
	at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:518) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:405) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:405) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:265) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:257) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:234) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

Which is definitely not user friendly, but a highly likely accidental configuration for a user that expects behavior like the (crude) asciiart above.

elasticmachine · 2021-03-11T21:10:33Z

Pinging @elastic/es-core-features (Team:Core/Features)

jloleysens · 2021-03-15T13:44:29Z

@talevy thanks for writing this up!

I totally agree with Lee's points above regarding default behavior. I just wanted to add my very outsider-level, UI-centric perspective on the default behavior of rollups:

It would be great to get some help thinking through helping users configure policies in the UI to create a flow that provides for the "90% of cases" case.

For instance, rollup could primarily be presented as a way to reduce disk usage by creating lower resolution version of the original index. In this case I would probably want the rollup index to be deleted at a different time to the original index. By default, I would expect this data to live on the same tier/phase as the action that created it (Lee's graph).

Alternatively, rollup could primarily be presented as a cache layer over the original data to speed up queries. As a user: I am probably OK with the idea that, by default, the cache layer will be deleted along with the original index and lives on the same phases/tiers as the action that created it (Lee's graph). (I'm not sure how popular this will be and whether doing something like this is possible in the current API).

It would be helpful to be as clear as possible on what we want the focus to be as this can lead to different default behavior and config (in the UI at least). It seems as though we want the "reduce disk usage" story to have primary emphasis?

Looking at the proposed API changes accompanying UI might look like:

Some open questions for me with this approach:

How does this interact with the ability to define a different ILM policy in the rollup action configuration -- or is that being removed?
1. If not: does one override the other? Can this setting be thought of as a fallback/clean up step to delete all rollup data created by the policy?
How does this interact with defining multiple rollup actions in a policy (i.e. in hot and cold)? Is it fine to have the same delete for indices created by rollup in hot and rollup in cold?
Will supporting multiple rollups per rollup action be problematic with this API? Do I want the same deletion date for all of my rollup data?

csoulios · 2021-03-16T14:21:57Z

@dakrone thanks for your comments

Rollup indices are mostly created to cover the following two use cases:

To save space by deleting the original data after some time and only keep the rollup data around.
To act as cache for some frequently run aggregations that would take longer to be computed out of the original data. This also allows original indices to transition to searchable snapshots.

This means that rollup indices are orders of magnitude smaller than the original indices and also they should be fast to query. Therefore, it does not make much sense to have rollup indices in the cold phase. (Docs say Cold: The index is no longer being updated and is seldom queried. The information still needs to be searchable, but it’s okay if those queries are slower.)

Also, rollups are immutable. Once a rollup index is created, it is never modified. So it does not make sense to have them in the hot phase either. (Docs also say Hot: The index is actively being updated and queried.)

So to your question:

Where the rollup is created in the cold phase, and then goes through all subsequent phases in the ILM lifecycle.

Is this what we want? Do we ever envision a need for the rollup index to start its lifecycle over? Does it ever need to go back through the hot, warm, and part of the cold phases to execute actions?

I would say yes. When a rollup is created in the cold phase, it should first go to the warm phase and maybe transition to the rest of the phases and finally be deleted.

Another question that is worth discussing is "What actions are rollup indices eligible for?". So far we have identified the delete action. Are there any other actions? I know that a "rollup of rollups" feature has been frequently requested (cc @giladgal) but I am not sure if other actions are applicable (such as shrink, allocate or even searchable-snapshot). The only thing we are sure about is that actions at rollup and original indices do not share the same actions or even the same phase transitions.

Finally, there is one last point we should consider: In a data stream we can potentially have multiple rollup configurations. Different rollups can be created to cover different intervals, timezones, bucket groupings or metrics. Should we delete all rollup indices together? Probably not. Should we create a separate ILM policy per rollup configuration? That would be a configuration nightmare.

What we are looking for is to strike the right balance between:

Absolute simplicity: we only allow deletion of rollups at a later point in time.
Absolute flexibility: each rollup index inside a data stream can have its own lifecycle.

giladgal · 2021-03-17T08:37:14Z

I agree with everything @csoulios wrote. Some users will collect a lot of rollup data and keep it for a long time, and they will want to keep older rollup indices on more economic data tiers, so we need to allow a rolled up index to travel through different tiers according to the user’s definitions. As @jloleysens wrote, if we need to prioritize between the two usage patterns -- rollup & delete the original vs. rollup as a caching layer -- than we would probably favor the rollup & delete the original use case, but we aim to support both and both must be supported through querying the data stream. We also do want to support rollup of rollup.

Could the compromise between simplicity and flexibility be to create a default rollup ILM as part of the original index ILM, and to allow those who choose not to use the default to move to creating a separate ILM policy per rolled up index? The main question is if we can do that and still maintain the rollup index and the original index in the same data stream (for querying purposes) although they are managed by separate ILM policies.

dakrone · 2021-03-17T14:37:49Z

Thanks for clarifying the use case @csoulios, it's very helpful! Okay, it sounds like we need the ability to have a separate lifecycle for the rollup index, and we also want to be able to separate the tier/phase for the rollup versus the original index.

We also do want to support rollup of rollup.

I'm not familiar with all the technical implementation details, but will a rollup of a rollup require any separate handling outside of how we will plan to treat a rollup index? I am guessing we can treat it just like a regular rollup, but maybe I'm mistaken.

Could the compromise between simplicity and flexibility be to create a default rollup ILM as part of the original index ILM, and to allow those who choose not to use the default to move to creating a separate ILM policy per rolled up index?

I was thinking we might want to ship a default policy for rollup indices (something like, "keep it in the warm phase forever", very basic) that rollups could use by default, with the option to specify a different ILM policy, similar to what we have today.

The main question is if we can do that and still maintain the rollup index and the original index in the same data stream (for querying purposes) although they are managed by separate ILM policies.

Yep, they can absolutely be in the same data stream with different policies, that's totally okay.

dakrone · 2021-03-17T14:54:12Z

Alternatively, if we wanted to keep a single policy, maybe we could allow the definition of a policy for only rollup indices inline in the rollup configuration itself. We'd then look at the index metadata and execute the contained policy instead of the parent policy for an index deemed a "rollup".

This might get a bit complicated though, if multiple rollup actions are specified (ie, one in hot, one in cold).

csoulios · 2021-03-26T13:52:10Z

In the past few weeks we discussed the integration of the new rollup functionalities that the rollup group is working on with several teams. We want to integrate rollups in data streams and in the Index Lifecycle Management as a native feature for metrics.

However, we found it hard to achieve a consensus on the best approach for the integration. The design that we made is clear and solves all the needs that were raised by rollup v1 users, but it leaves a lot of room for use cases that we are not sure we want to expose.

As a result of these discussions, we agreed to focus on the following two identified problems:

Downsampling for time series.
Dimensionality reduction for time series.

We think that reducing the number of problems we want to solve will help to gather consensus more quickly.

In view of the above decisions, I am closing this ticket and we will revisit ILM + rollup integration for the two specific use cases soon. Thanks everyone for the participation and extremely useful feedback.

talevy added :StorageEngine/Rollup Turn fine-grained time-based data into coarser-grained data team-discuss labels Mar 11, 2021

elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 11, 2021

dakrone added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Mar 11, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Mar 11, 2021

talevy changed the title ~~Find single ILM Policy solution for managing lifecycle of rollup indices~~ Simplify ILM Policy solution for managing lifecycle of rollup indices Mar 17, 2021

talevy mentioned this issue Mar 22, 2021

Refactor rollups meta (AKA Rollup V2) #42720

Closed

21 tasks

csoulios closed this as completed Mar 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify ILM Policy solution for managing lifecycle of rollup indices #70334

Simplify ILM Policy solution for managing lifecycle of rollup indices #70334

talevy commented Mar 11, 2021

elasticmachine commented Mar 11, 2021

cjcenizal commented Mar 11, 2021

dakrone commented Mar 11, 2021

elasticmachine commented Mar 11, 2021

jloleysens commented Mar 15, 2021 •

edited

Loading

csoulios commented Mar 16, 2021 •

edited by leehinman

Loading

giladgal commented Mar 17, 2021

dakrone commented Mar 17, 2021

dakrone commented Mar 17, 2021

csoulios commented Mar 26, 2021

Simplify ILM Policy solution for managing lifecycle of rollup indices #70334

Simplify ILM Policy solution for managing lifecycle of rollup indices #70334

Comments

talevy commented Mar 11, 2021

elasticmachine commented Mar 11, 2021

cjcenizal commented Mar 11, 2021

dakrone commented Mar 11, 2021

elasticmachine commented Mar 11, 2021

jloleysens commented Mar 15, 2021 • edited Loading

csoulios commented Mar 16, 2021 • edited by leehinman Loading

giladgal commented Mar 17, 2021

dakrone commented Mar 17, 2021

dakrone commented Mar 17, 2021

csoulios commented Mar 26, 2021

jloleysens commented Mar 15, 2021 •

edited

Loading

csoulios commented Mar 16, 2021 •

edited by leehinman

Loading