-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Simplify ILM Policy solution for managing lifecycle of rollup indices #70334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Something it would be good to clarify is whether it is valid for a rollup index to "go back to a prior life". What I mean by this is, let's assume that a user has an ILM policy that contains a
Where the rollup is created in the cold phase, and then goes through all subsequent phases in the ILM lifecycle. Is this what we want? Do we ever envision a need for the rollup index to start its lifecycle over? Does it ever need to go back through the hot, warm, and part of the cold phases to execute actions?
The current implementation doesn't actually look like the graph above, because from my reading (please correct me if I'm wrong!) we don't copy over the lifecycle execution state, so the rollup index will start over in the java.lang.IllegalStateException: expected index [rollup-.ds-myindex-2021.03.11-000001-tiu0jc32twi6ilco0b-rra] with policy [my-policy] to have current step consistent with provided step key ({"phase":"hot","action":"rollover","name":"update-rollover-lifecycle-date"}) but it was {"phase":"hot","action":"rollover","name":"ERROR"}
at org.elasticsearch.xpack.ilm.IndexLifecycleRunner.maybeRunAsyncAction(IndexLifecycleRunner.java:286) ~[?:?]
at org.elasticsearch.xpack.ilm.ExecuteStepsUpdateTask.clusterStateProcessed(ExecuteStepsUpdateTask.java:195) ~[?:?]
at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:518) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:405) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
at org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:405) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:265) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:257) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:234) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204) [elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
at java.lang.Thread.run(Thread.java:832) [?:?] Which is definitely not user friendly, but a highly likely accidental configuration for a user that expects behavior like the (crude) asciiart above. |
Pinging @elastic/es-core-features (Team:Core/Features) |
@talevy thanks for writing this up! I totally agree with Lee's points above regarding default behavior. I just wanted to add my very outsider-level, UI-centric perspective on the default behavior of rollups: It would be great to get some help thinking through helping users configure policies in the UI to create a flow that provides for the "90% of cases" case. For instance, rollup could primarily be presented as a way to reduce disk usage by creating lower resolution version of the original index. In this case I would probably want the rollup index to be deleted at a different time to the original index. By default, I would expect this data to live on the same tier/phase as the action that created it (Lee's graph). Alternatively, rollup could primarily be presented as a cache layer over the original data to speed up queries. As a user: I am probably OK with the idea that, by default, the cache layer will be deleted along with the original index and lives on the same phases/tiers as the action that created it (Lee's graph). (I'm not sure how popular this will be and whether doing something like this is possible in the current API). It would be helpful to be as clear as possible on what we want the focus to be as this can lead to different default behavior and config (in the UI at least). It seems as though we want the "reduce disk usage" story to have primary emphasis? Looking at the proposed API changes accompanying UI might look like: Some open questions for me with this approach:
|
@dakrone thanks for your comments Rollup indices are mostly created to cover the following two use cases:
This means that rollup indices are orders of magnitude smaller than the original indices and also they should be fast to query. Therefore, it does not make much sense to have rollup indices in the cold phase. (Docs say Also, rollups are immutable. Once a rollup index is created, it is never modified. So it does not make sense to have them in the hot phase either. (Docs also say So to your question:
I would say yes. When a rollup is created in the cold phase, it should first go to the warm phase and maybe transition to the rest of the phases and finally be deleted. Another question that is worth discussing is "What actions are rollup indices eligible for?". So far we have identified the Finally, there is one last point we should consider: In a data stream we can potentially have multiple rollup configurations. Different rollups can be created to cover different intervals, timezones, bucket groupings or metrics. Should we delete all rollup indices together? Probably not. Should we create a separate ILM policy per rollup configuration? That would be a configuration nightmare. What we are looking for is to strike the right balance between:
|
I agree with everything @csoulios wrote. Some users will collect a lot of rollup data and keep it for a long time, and they will want to keep older rollup indices on more economic data tiers, so we need to allow a rolled up index to travel through different tiers according to the user’s definitions. As @jloleysens wrote, if we need to prioritize between the two usage patterns -- rollup & delete the original vs. rollup as a caching layer -- than we would probably favor the rollup & delete the original use case, but we aim to support both and both must be supported through querying the data stream. We also do want to support rollup of rollup. Could the compromise between simplicity and flexibility be to create a default rollup ILM as part of the original index ILM, and to allow those who choose not to use the default to move to creating a separate ILM policy per rolled up index? The main question is if we can do that and still maintain the rollup index and the original index in the same data stream (for querying purposes) although they are managed by separate ILM policies. |
Thanks for clarifying the use case @csoulios, it's very helpful! Okay, it sounds like we need the ability to have a separate lifecycle for the rollup index, and we also want to be able to separate the tier/phase for the rollup versus the original index.
I'm not familiar with all the technical implementation details, but will a rollup of a rollup require any separate handling outside of how we will plan to treat a rollup index? I am guessing we can treat it just like a regular rollup, but maybe I'm mistaken.
I was thinking we might want to ship a default policy for rollup indices (something like, "keep it in the warm phase forever", very basic) that rollups could use by default, with the option to specify a different ILM policy, similar to what we have today.
Yep, they can absolutely be in the same data stream with different policies, that's totally okay. |
Alternatively, if we wanted to keep a single policy, maybe we could allow the definition of a policy for only rollup indices inline in the rollup configuration itself. We'd then look at the index metadata and execute the contained policy instead of the parent policy for an index deemed a "rollup". This might get a bit complicated though, if multiple rollup actions are specified (ie, one in hot, one in cold). |
In the past few weeks we discussed the integration of the new rollup functionalities that the rollup group is working on with several teams. We want to integrate rollups in data streams and in the Index Lifecycle Management as a native feature for metrics. However, we found it hard to achieve a consensus on the best approach for the integration. The design that we made is clear and solves all the needs that were raised by rollup v1 users, but it leaves a lot of room for use cases that we are not sure we want to expose. As a result of these discussions, we agreed to focus on the following two identified problems:
We think that reducing the number of problems we want to solve will help to gather consensus more quickly. In view of the above decisions, I am closing this ticket and we will revisit ILM + rollup integration for the two specific use cases soon. Thanks everyone for the participation and extremely useful feedback. |
Currently, we require that if a user wants to manage the lifecycle of a rollup index created within an ILM Policy (within a Datastream, for example), that this index have a separate policy. Maintaining policies is not easy, and this makes it even more difficult. For this reason, we should re-explore solutions to this problem.
One solution is to limit the actions that work for rollup indices. For example, maybe a rollup index must share the same lifecycle/policy as the original index for all phases except the
delete
phase. This would allow rollup indices to be deleted at a later time.The policy for this may look like:
Potentially a separate
after
parameter just for the rollup index created by this policy can be used. Thus managing both original and rollup with minimal modification and no additional policy to maintain.This is just one option, may come up with others.
The text was updated successfully, but these errors were encountered: