ILM: Make all the shrink action steps retryable #70107

andreidan · 2021-03-08T18:50:19Z

This aims at making the shrink action retryable. Every step is
retryable, but in order to provide an experience where ILM tries
to achieve a successful shrink even when the target node goes
missing permanently or the shrunk index cannot recover, this also
introduces a retryable shrink cycle within the shrink action.

The shrink action will generate a unique index name that'll be the
shrunk index name. The generated index name is stored in the lifecycle
state.

If the shrink action ends up waiting for the source shards to
collocate or for the shrunk index to recover for more than the configured
LIFECYCLE_STEP_WAIT_TIME_THRESHOLD setting, it will move back
to clean up the attempted (and failed) shrunk index and will retry
generating a new index name and attempting to shrink the source
to the newly generated index name.

Relates to #48183

This aims at making the shrink action retryable. Every step is retryably, but in order to provide an experience where ILM tries to achieve a successful shrink even when the target node goes missing permanently or the shrunk index cannot recover, this also introduces a retryable shrink cycle within the shrink action. The shrink action will generate a unique index name that'll be the shrunk index name. The generated index name is stored in the lifecycle state. If the shrink action ends up waiting for the source shards to colocate or for the shrunk index to recover for more than the configured `LIFECYCLE_STEP_WAIT_TIME_THRESHOLD` setting, it will move back to cleanup the attempted (and failed) shrunk index and will retry generating a new index name and attempting to shrink the source to the newly generated index name.

andreidan · 2021-03-09T07:50:57Z

@elasticmachine update branch

We were setting the nextStepKey regardless if the underlying step execution was successful or not. We were also asserting we have a `nextStepKey` even if the step failed (and moved into the `ERROR` step). I don't think this is correct as from the `ERROR` step the only place ILM can move is back to the `failedStep` (if it was retryable).

elasticmachine · 2021-03-09T09:06:50Z

Pinging @elastic/es-core-features (Team:Core/Features)

andreidan · 2021-03-09T09:29:53Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/LifecycleSettings.java

+    // This setting configures how much time since step_time should ILM wait for a condition to be met. After the threshold wait time has
+    // elapsed ILM will likely stop waiting and go to the next step.
+    // Also see {@link org.elasticsearch.xpack.core.ilm.ClusterStateWaitUntilThresholdStep}
+    public static final Setting<TimeValue> LIFECYCLE_STEP_WAIT_TIME_THRESHOLD_SETTING =


Would we want to document this?

I think so, even if only in the shrink action documentation.

andreidan · 2021-03-11T10:24:13Z

@elasticmachine update branch

dakrone

Thanks for working on this Andrei! I left some comments for an initial review.

.../core/src/main/java/org/elasticsearch/xpack/core/ilm/ClusterStateWaitUntilThresholdStep.java

dakrone · 2021-03-11T16:03:56Z

.../core/src/main/java/org/elasticsearch/xpack/core/ilm/ClusterStateWaitUntilThresholdStep.java

+            // wonderful thing)
+            TimeValue retryThreshold = LifecycleSettings.LIFECYCLE_STEP_WAIT_TIME_THRESHOLD_SETTING.get(idxMeta.getSettings());
+            LifecycleExecutionState lifecycleState = fromIndexMetadata(idxMeta);
+            if (waitedMoreThanThresholdLevel(retryThreshold, lifecycleState, Clock.systemUTC())) {


I'm slightly concerned that using Clock.systemUTC() here means that we'll be subject to drift as the time on the underlying machine changes, but to fix that we'd have to use System.nanoTime() which would be a little strange since it's not necessarily "real" time.

(not really something we should change, I'm just voicing my concern)

Oh, you're absolutely right - wall clocks are not trustworthy. I don't think nanoTime would be the answer as the reading invariant on that api is to be monotonically increasing, more so than accuracy.

I don't think this is too bad because if we get a reading where the time drift is hours away we'll retry the shrink cycle (so nothing truly bad happens, except a wasteful - and presumbaly eventually successful - shrink operation, with the note that we'll at least delete the abandoned shrunk index)

dakrone · 2021-03-11T17:37:59Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/GenerateSnapshotNameStep.java

@@ -38,7 +38,7 @@

    public static final String NAME = "generate-snapshot-name";

-    private static final Logger logger = LogManager.getLogger(CreateSnapshotStep.class);
+    private static final Logger logger = LogManager.getLogger(GenerateSnapshotNameStep.class);


Good catch :)

dakrone · 2021-03-11T17:41:09Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/GenerateUniqueIndexNameStep.java

+        String generatedIndexName = generateValidIndexName(prefix, index.getName());
+        ActionRequestValidationException validationException = validateGeneratedIndexName(generatedIndexName, clusterState);
+        if (validationException != null) {
+            logger.warn("unable to generate a valid shrink index name as part of policy [{}] for index [{}] due to [{}]",


Do we want the term "shrink" in here? I was thinking we may want to end up re-using this for other purposes.

Ah yes, good catch

dakrone · 2021-03-11T17:47:06Z

.../plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/GenerateUniqueIndexNameStep.java

+    static String generateValidIndexName(String prefix, String indexName) {
+        return prefix + indexName + "-" + generateValidIndexSuffix(() -> UUIDs.randomBase64UUID().toLowerCase(Locale.ROOT));


I don't think we want a suffix on the index, it should be

<prefix>-<uuid>-<indexName>

The reason is that if we add any suffixes, then we break origination date parsing for anyone that has index.lifecycle.parse_origination_date set for their indices.

I also think that we don't need a full UUID, which can be quite long! Imagine if we end up using this logic for all our prefixing steps and we'd end up with:

partial-jrx_xoiprjuncgvs5vmfmq-restored-krx_xoiprjuncgvs5vmfmq-shrink-lrx_xoiprjuncgvs5vmfmq-rollup-msy_xoiprjuncgvs5vmfmq-.ds-myindex-2021.03.11-000001

Which is super long! We even run the risk of hitting the maximum index name length! We don't expect this UUID to clash that frequently, perhaps we can use a subset of the UUID, like 4 characters or so, as the UUID? Especially because if we do have a clash, then we'll ERROR and retry again with a different UUID next time.

dakrone · 2021-03-11T17:52:45Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/ShrinkAction.java

-            SHRUNKEN_INDEX_PREFIX);
-        ShrunkShardsAllocatedStep allocated = new ShrunkShardsAllocatedStep(enoughShardsKey, copyMetadataKey, SHRUNKEN_INDEX_PREFIX);
+        ClusterStateWaitUntilThresholdStep checkShrinkReadyStep = new ClusterStateWaitUntilThresholdStep(
+            new CheckShrinkReadyStep(allocationRoutedKey, shrinkKey), cleanupShrinkIndexKey);


I think this only needs to rewind back to setSingleNodeKey instead of cleanupShrinkIndexKey, right? Because this is the pre-step when the shrunken index hasn't actually been created yet.

We only want to rewind and pick a new node if we can't allocate within the time frame, no need to delete anything.

dakrone · 2021-03-11T17:54:40Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/ShrinkAction.java

@@ -177,14 +183,20 @@ public boolean isSafeAction() {
        CheckNotDataStreamWriteIndexStep checkNotWriteIndexStep = new CheckNotDataStreamWriteIndexStep(checkNotWriteIndex,
            waitForNoFollowerStepKey);
        WaitForNoFollowersStep waitForNoFollowersStep = new WaitForNoFollowersStep(waitForNoFollowerStepKey, readOnlyKey, client);
-        UpdateSettingsStep readOnlyStep = new UpdateSettingsStep(readOnlyKey, setSingleNodeKey, client, readOnlySettings);
+        UpdateSettingsStep readOnlyStep = new UpdateSettingsStep(readOnlyKey, cleanupShrinkIndexKey, client, readOnlySettings);
+        CleanupShrinkIndexStep cleanupShrinkIndexStep = new CleanupShrinkIndexStep(cleanupShrinkIndexKey, generateShrinkIndexNameKey,


Can you comment the steps in this method so someone that comes in can follow the execution flow? I think it'd be helpful for others reading the code.

dakrone · 2021-03-11T17:58:59Z

...ck/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/ShrunkShardsAllocatedStep.java

+        LifecycleExecutionState lifecycleState = LifecycleExecutionState.fromIndexMetadata(indexMetadata);
+        String shrunkenIndexName = lifecycleState.getShrinkIndexName();
+        if (shrunkenIndexName == null) {
+            // this is for BWC reasons for polices that are in the middle of executing the shrink action when the update to generated
+            // names happens
+            shrunkenIndexName = shrunkIndexPrefix + index.getName();
+        }


It looks like we duplicate this enough places that maybe we want to put this in a static method somewhere, passing in the prefix and IndexMetadata, what do you think?

dakrone · 2021-03-11T18:02:11Z

...ugin/ilm/qa/multi-cluster/src/test/java/org/elasticsearch/xpack/ilm/CCRIndexLifecycleIT.java

-                    // Sometimes throw in an extraneous unfollow just to check it doesn't break anything
-                    if (randomBoolean()) {


Why was this removed?

Ah, I think I was suspecting this could cause some flakiness, but I believe it might've been due to dev issues as I was working through this feature (saving the execution state - namely the shrink index name being present in the execution state when we reach the cold phase). I'll add this back. Thanks!

dakrone · 2021-03-11T18:05:44Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ilm/LifecycleSettings.java

+    // elapsed ILM will likely stop waiting and go to the next step.
+    // Also see {@link org.elasticsearch.xpack.core.ilm.ClusterStateWaitUntilThresholdStep}
+    public static final Setting<TimeValue> LIFECYCLE_STEP_WAIT_TIME_THRESHOLD_SETTING =
+        Setting.positiveTimeSetting(LIFECYCLE_STEP_WAIT_TIME_THRESHOLD, TimeValue.timeValueHours(12), Setting.Property.Dynamic,


Should we enforce a minimum setting? It would be bad to be spinning forever because someone put 1 second as their wait time.

Good point. Shall we set it to one hour minimum? (this will make it impossible to integration test - https://github.com/elastic/elasticsearch/pull/70107/files/db8a13789e40a0b7f64263a0c88a257b21eb206c#diff-00728663bc72dd25e9832acf55fd6c4dd690a34d3c9ef6e5408464d81ff1f785R309 - however, maybe the unit tests we have around the ClusterStateWaitUntilThresholdStep are enough)

I managed to integration test the rewind even with a 1h minimum value in 73e363d

andreidan · 2021-03-13T08:47:01Z

@elasticmachine update branch

Co-authored-by: Lee Hinman <[email protected]>

This changes the generated index name format to <prefix>-<uuid>-<indexname>, limitting the length of the uuid to a maximum of 4 chars.

…D setting

andreidan · 2021-03-17T10:25:30Z

@jrodewig thanks so much for restructuring and rewording the docs

…d the source index does not exist anymore

andreidan · 2021-03-17T12:01:19Z

if manually moving an index that is past a shrink step back to any point before the shrink, the cleanup step actually ends up deleting the index itself

@dakrone ah interesting. So this happened because the alias/identity of the source index was swapped to the shrink index (and the source index was deleted in the process). I pushed 4b091e6 to skip the cleanup index step if the managed index is a shrunk index and the source index does not exist anymore

andreidan · 2021-03-17T12:23:37Z

@elasticmachine update branch

andreidan · 2021-03-18T09:51:18Z

@elasticmachine update branch

dakrone

LGTM, thanks for iterating on this Andrei!

This aims at making the shrink action retryable. Every step is retryable, but in order to provide an experience where ILM tries to achieve a successful shrink even when the target node goes missing permanently or the shrunk index cannot recover, this also introduces a retryable shrink cycle within the shrink action. The shrink action will generate a unique index name that'll be the shrunk index name. The generated index name is stored in the lifecycle state. If the shrink action ends up waiting for the source shards to colocate or for the shrunk index to recover for more than the configured `LIFECYCLE_STEP_WAIT_TIME_THRESHOLD` setting, it will move back to clean up the attempted (and failed) shrunk index and will retry generating a new index name and attempting to shrink the source to the newly generated index name. (cherry picked from commit 9831084) Signed-off-by: Andrei Dan <[email protected]>

andreidan added the :Data Management/ILM+SLM Index and Snapshot lifecycle management label Mar 8, 2021

andreidan mentioned this pull request Mar 8, 2021

ILM: make the shrink action retryable #70028

Closed

Cleanup shrink index step succeeds if the index doesn't exist

4a97f4a

elasticmachine and others added 2 commits March 9, 2021 02:50

Merge branch 'master' into ilm-retryable-shrink-action

8734819

andreidan added v7.13.0 v8.0.0 >feature labels Mar 9, 2021

andreidan marked this pull request as ready for review March 9, 2021 09:06

elasticmachine added the Team:Data Management Meta label for data/management team label Mar 9, 2021

andreidan requested a review from dakrone March 9, 2021 09:07

andreidan commented Mar 9, 2021

View reviewed changes

Merge branch 'master' into ilm-retryable-shrink-action

db8a137

dakrone requested changes Mar 11, 2021

View reviewed changes

elasticmachine and others added 11 commits March 13, 2021 03:47

Merge branch 'master' into ilm-retryable-shrink-action

6d345a5

Make the operator precedence explicit

d814096

Co-authored-by: Lee Hinman <[email protected]>

Append a random 4 char sequence to the index prefix

5f715ef

This changes the generated index name format to <prefix>-<uuid>-<indexname>, limitting the length of the uuid to a maximum of 4 chars.

Fix ILMMultiNodeIT

7a2a1b7

Rewind to setSingleNodeKey if the source index shards can't be colocated

1db1bd6

Comment the shrink action steps

b09abee

Add ShrinkIndexNameSupplier

9c88efb

Readd a random cold phase in the CCR lifecycle tests

dd7a0a2

Fix test

4e0440b

Unify usage of ShrinkIndexNameSupplier in the ShrinkAction

2f80500

Enforce minimum value of 1h for the LIFECYCLE_STEP_WAIT_TIME_THRESHOL…

73e363d

…D setting

Skip the cleanup index step if the managed index is a shrunk index an…

4b091e6

…d the source index does not exist anymore

andreidan added 4 commits March 17, 2021 12:05

Convert if check to assert

45d5fd8

Drop null nextStepKey from log statement

02f2530

generateValidIndexSuffix lowercases input

a36f4b7

Shrink action uses the readonly step and test fixes

85b22d7

elasticmachine and others added 2 commits March 17, 2021 08:23

Merge branch 'master' into ilm-retryable-shrink-action

88ca857

Disable bwc tests as the explain API output is modified in this PR

fd6a536

andreidan requested a review from dakrone March 17, 2021 12:29

Merge branch 'master' into ilm-retryable-shrink-action

6329adf

dakrone approved these changes Mar 18, 2021

View reviewed changes

jrodewig approved these changes Mar 18, 2021

View reviewed changes

andreidan merged commit 9831084 into elastic:master Mar 18, 2021

andreidan added the backport pending label Mar 18, 2021

andreidan mentioned this pull request Mar 18, 2021

[7.x] ILM: Make all the shrink action steps retryable (#70107) #70573

Merged

andreidan added a commit to andreidan/elasticsearch that referenced this pull request Mar 18, 2021

Reenable bwc and update version checks after backporting elastic#70107

e7cdf11

andreidan mentioned this pull request Mar 18, 2021

Reenable bwc and update version checks after backporting #70107 #70574

Merged

andreidan added a commit that referenced this pull request Mar 18, 2021

Reenable bwc and update version checks after backporting #70107 (#70574)

4bb3d21

andreidan removed the backport pending label Mar 22, 2021

stevejgordon mentioned this pull request Apr 21, 2021

7.13.0 Meta Ticket elastic/elasticsearch-net#5584

Closed

62 tasks

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

inqueue mentioned this pull request Aug 4, 2021

[Index Management] Wildcard character support for index name search filter elastic/kibana#107677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILM: Make all the shrink action steps retryable #70107

ILM: Make all the shrink action steps retryable #70107

andreidan commented Mar 8, 2021 •

edited

Loading

andreidan commented Mar 9, 2021

elasticmachine commented Mar 9, 2021

andreidan Mar 9, 2021

dakrone Mar 11, 2021

andreidan commented Mar 11, 2021

dakrone left a comment

dakrone Mar 11, 2021

andreidan Mar 16, 2021 •

edited

Loading

dakrone Mar 11, 2021

dakrone Mar 11, 2021

andreidan Mar 13, 2021

dakrone Mar 11, 2021

dakrone Mar 11, 2021

dakrone Mar 11, 2021

dakrone Mar 11, 2021

dakrone Mar 11, 2021

andreidan Mar 15, 2021 •

edited

Loading

dakrone Mar 11, 2021

andreidan Mar 15, 2021

andreidan Mar 16, 2021

andreidan commented Mar 13, 2021

andreidan commented Mar 17, 2021

andreidan commented Mar 17, 2021

andreidan commented Mar 17, 2021

andreidan commented Mar 18, 2021

dakrone left a comment

		static String generateValidIndexName(String prefix, String indexName) {
		return prefix + indexName + "-" + generateValidIndexSuffix(() -> UUIDs.randomBase64UUID().toLowerCase(Locale.ROOT));

		// Sometimes throw in an extraneous unfollow just to check it doesn't break anything
		if (randomBoolean()) {

ILM: Make all the shrink action steps retryable #70107

ILM: Make all the shrink action steps retryable #70107

Conversation

andreidan commented Mar 8, 2021 • edited Loading

andreidan commented Mar 9, 2021

elasticmachine commented Mar 9, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Mar 11, 2021

dakrone left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Mar 16, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan Mar 15, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreidan commented Mar 13, 2021

andreidan commented Mar 17, 2021

andreidan commented Mar 17, 2021

andreidan commented Mar 17, 2021

andreidan commented Mar 18, 2021

dakrone left a comment

Choose a reason for hiding this comment

andreidan commented Mar 8, 2021 •

edited

Loading

andreidan Mar 16, 2021 •

edited

Loading

andreidan Mar 15, 2021 •

edited

Loading