Skip to content

ILM Make the check-rollover-ready step retryable #48256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 32 commits into from
Oct 31, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
3fb77aa
ILM open/close steps are noop if idx is open/close
andreidan Oct 18, 2019
f8f44fd
ILM Make the `check-rollover-ready` step retryable
andreidan Oct 18, 2019
10368db
Drop unused imports
andreidan Oct 18, 2019
016d1c9
ILM add step retries information to explain api
andreidan Oct 18, 2019
01a8bc1
Add versioning protection.
andreidan Oct 28, 2019
1fd1ebc
Merge branch 'master' into ilm-retry-failed-step
elasticmachine Oct 28, 2019
b0256dd
Guard the serialisation changes against gte 8.0.0
andreidan Oct 28, 2019
6ea24c5
Rename isTransitiveError to isAutoRetryableError
andreidan Oct 29, 2019
4d6a488
Fix the ILM explain tet
andreidan Oct 29, 2019
ee80993
IndexLifecycleExplainResponseTest: adjust the random retry values
andreidan Oct 29, 2019
5ba959d
Dorp lifecycle poll interval configuration in IT
andreidan Oct 29, 2019
79d52c3
Change max retry setting default to -1
andreidan Oct 29, 2019
3f2e68c
Log the index name too
andreidan Oct 29, 2019
ed8b35e
On state change don't attempt to retry failed step
andreidan Oct 29, 2019
7bd179c
Drop moveClusterStateToRetryFailedStep as it was just an overload
andreidan Oct 29, 2019
2dc04ad
Test validateTransition
andreidan Oct 29, 2019
4019c99
Revert "ILM open/close steps are noop if idx is open/close"
andreidan Oct 29, 2019
e34bd81
Update isRetryable javadoc
andreidan Oct 29, 2019
18734b9
Merge branch 'master' into ilm-retry-failed-step
elasticmachine Oct 29, 2019
ad1417c
Remove the LIFECYCLE_MAX_FAILED_STEP_RETRIES_COUNT setting.
andreidan Oct 30, 2019
398e82e
Drop unused field
andreidan Oct 30, 2019
092d256
Guard against a possible null failed step
andreidan Oct 30, 2019
90cfb01
Throw IndexNotFoundException instead of IllegalArgumentException
andreidan Oct 30, 2019
bb27c6f
Rename to moveClusterStateToPreviouslyFailedStep
andreidan Oct 30, 2019
8cfe423
Don't use randomBool as we're asserting only the managed case
andreidan Oct 30, 2019
968ef38
Test the ilm/explain retry count output in separate test
andreidan Oct 30, 2019
998eaa9
Fix test to expect IndexNotFoundException
andreidan Oct 30, 2019
6a4e3eb
Update retry api test to expect index not found exception
andreidan Oct 30, 2019
ed687fb
Revert "Update retry api test to expect index not found exception"
andreidan Oct 30, 2019
434fa64
Revert "Fix test to expect IndexNotFoundException"
andreidan Oct 30, 2019
a3e9a89
Revert "Throw IndexNotFoundException instead of IllegalArgumentExcept…
andreidan Oct 30, 2019
52565d6
Merge branch 'master' into ilm-retry-failed-step
elasticmachine Oct 31, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 16 additions & 7 deletions docs/reference/ilm/apis/explain.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -239,8 +239,11 @@ information for the step that's being performed on the index.

If the index is in the ERROR step, something went wrong while executing a
step in the policy and you will need to take action for the index to proceed
to the next step. To help you diagnose the problem, the explain response shows
the step that failed and the step info provides information about the error.
to the next step. Some steps are safe to automatically be retried in certain
circumstances. To help you diagnose the problem, the explain response shows
the step that failed, the step info which provides information about the error,
and information about the retry attempts executed for the failed step if it's
the case.

[source,console-result]
--------------------------------------------------
Expand All @@ -262,10 +265,12 @@ the step that failed and the step info provides information about the error.
"step": "ERROR",
"step_time_millis": 1538475653317,
"step_time": "2018-10-15T13:45:22.577Z",
"failed_step": "attempt-rollover", <1>
"step_info": { <2>
"type": "resource_already_exists_exception",
"reason": "index [test-000057/H7lF9n36Rzqa-KfKcnGQMg] already exists",
"failed_step": "check-rollover-ready", <1>
"is_auto_retryable_error": true, <2>
"failed_step_retry_count": 1, <3>
"step_info": { <4>
"type": "cluster_block_exception",
"reason": "index [test-000057/H7lF9n36Rzqa-KfKcnGQMg] blocked by: [FORBIDDEN/5/index read-only (api)",
"index_uuid": "H7lF9n36Rzqa-KfKcnGQMg",
"index": "test-000057"
},
Expand All @@ -290,4 +295,8 @@ the step that failed and the step info provides information about the error.
// TESTRESPONSE[skip:not possible to get the cluster into this state in a docs test]

<1> The step that caused the error
<2> What went wrong
<2> Indicates if retrying the failed step can overcome the error. If this
is true, ILM will retry the failed step automatically.
<3> Shows the number of attempted automatic retries to execute the failed
step.
<4> What went wrong
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

package org.elasticsearch.xpack.core.ilm;

import org.elasticsearch.Version;
import org.elasticsearch.common.ParseField;
import org.elasticsearch.common.Strings;
import org.elasticsearch.common.bytes.BytesReference;
Expand Down Expand Up @@ -34,6 +35,8 @@ public class IndexLifecycleExplainResponse implements ToXContentObject, Writeabl
private static final ParseField ACTION_FIELD = new ParseField("action");
private static final ParseField STEP_FIELD = new ParseField("step");
private static final ParseField FAILED_STEP_FIELD = new ParseField("failed_step");
private static final ParseField IS_AUTO_RETRYABLE_ERROR_FIELD = new ParseField("is_auto_retryable_error");
private static final ParseField FAILED_STEP_RETRY_COUNT_FIELD = new ParseField("failed_step_retry_count");
private static final ParseField PHASE_TIME_MILLIS_FIELD = new ParseField("phase_time_millis");
private static final ParseField PHASE_TIME_FIELD = new ParseField("phase_time");
private static final ParseField ACTION_TIME_MILLIS_FIELD = new ParseField("action_time_millis");
Expand All @@ -55,6 +58,8 @@ public class IndexLifecycleExplainResponse implements ToXContentObject, Writeabl
(String) a[5],
(String) a[6],
(String) a[7],
(Boolean) a[14],
(Integer) a[15],
(Long) (a[8]),
(Long) (a[9]),
(Long) (a[10]),
Expand Down Expand Up @@ -82,6 +87,8 @@ public class IndexLifecycleExplainResponse implements ToXContentObject, Writeabl
PARSER.declareObject(ConstructingObjectParser.optionalConstructorArg(), (p, c) -> PhaseExecutionInfo.parse(p, ""),
PHASE_EXECUTION_INFO);
PARSER.declareString(ConstructingObjectParser.optionalConstructorArg(), AGE_FIELD);
PARSER.declareBoolean(ConstructingObjectParser.optionalConstructorArg(), IS_AUTO_RETRYABLE_ERROR_FIELD);
PARSER.declareInt(ConstructingObjectParser.optionalConstructorArg(), FAILED_STEP_RETRY_COUNT_FIELD);
}

private final String index;
Expand All @@ -97,21 +104,25 @@ public class IndexLifecycleExplainResponse implements ToXContentObject, Writeabl
private final boolean managedByILM;
private final BytesReference stepInfo;
private final PhaseExecutionInfo phaseExecutionInfo;
private final Boolean isAutoRetryableError;
private final Integer failedStepRetryCount;

public static IndexLifecycleExplainResponse newManagedIndexResponse(String index, String policyName, Long lifecycleDate,
String phase, String action, String step, String failedStep, Long phaseTime, Long actionTime, Long stepTime,
BytesReference stepInfo, PhaseExecutionInfo phaseExecutionInfo) {
return new IndexLifecycleExplainResponse(index, true, policyName, lifecycleDate, phase, action, step, failedStep, phaseTime,
actionTime, stepTime, stepInfo, phaseExecutionInfo);
String phase, String action, String step, String failedStep, Boolean isAutoRetryableError, Integer failedStepRetryCount,
Long phaseTime, Long actionTime, Long stepTime, BytesReference stepInfo, PhaseExecutionInfo phaseExecutionInfo) {
return new IndexLifecycleExplainResponse(index, true, policyName, lifecycleDate, phase, action, step, failedStep,
isAutoRetryableError, failedStepRetryCount, phaseTime, actionTime, stepTime, stepInfo, phaseExecutionInfo);
}

public static IndexLifecycleExplainResponse newUnmanagedIndexResponse(String index) {
return new IndexLifecycleExplainResponse(index, false, null, null, null, null, null, null, null, null, null, null, null);
return new IndexLifecycleExplainResponse(index, false, null, null, null, null, null, null, null, null, null, null, null, null,
null);
}

private IndexLifecycleExplainResponse(String index, boolean managedByILM, String policyName, Long lifecycleDate,
String phase, String action, String step, String failedStep, Long phaseTime, Long actionTime,
Long stepTime, BytesReference stepInfo, PhaseExecutionInfo phaseExecutionInfo) {
String phase, String action, String step, String failedStep, Boolean isAutoRetryableError,
Integer failedStepRetryCount, Long phaseTime, Long actionTime, Long stepTime,
BytesReference stepInfo, PhaseExecutionInfo phaseExecutionInfo) {
if (managedByILM) {
if (policyName == null) {
throw new IllegalArgumentException("[" + POLICY_NAME_FIELD.getPreferredName() + "] cannot be null for managed index");
Expand Down Expand Up @@ -142,6 +153,8 @@ private IndexLifecycleExplainResponse(String index, boolean managedByILM, String
this.actionTime = actionTime;
this.stepTime = stepTime;
this.failedStep = failedStep;
this.isAutoRetryableError = isAutoRetryableError;
this.failedStepRetryCount = failedStepRetryCount;
this.stepInfo = stepInfo;
this.phaseExecutionInfo = phaseExecutionInfo;
}
Expand All @@ -161,13 +174,22 @@ public IndexLifecycleExplainResponse(StreamInput in) throws IOException {
stepTime = in.readOptionalLong();
stepInfo = in.readOptionalBytesReference();
phaseExecutionInfo = in.readOptionalWriteable(PhaseExecutionInfo::new);
if (in.getVersion().onOrAfter(Version.V_8_0_0)) {
isAutoRetryableError = in.readOptionalBoolean();
failedStepRetryCount = in.readOptionalVInt();
} else {
isAutoRetryableError = null;
failedStepRetryCount = null;
}
} else {
policyName = null;
lifecycleDate = null;
phase = null;
action = null;
step = null;
failedStep = null;
isAutoRetryableError = null;
failedStepRetryCount = null;
phaseTime = null;
actionTime = null;
stepTime = null;
Expand All @@ -192,6 +214,10 @@ public void writeTo(StreamOutput out) throws IOException {
out.writeOptionalLong(stepTime);
out.writeOptionalBytesReference(stepInfo);
out.writeOptionalWriteable(phaseExecutionInfo);
if (out.getVersion().onOrAfter(Version.V_8_0_0)) {
out.writeOptionalBoolean(isAutoRetryableError);
out.writeOptionalVInt(failedStepRetryCount);
}
}
}

Expand Down Expand Up @@ -247,6 +273,14 @@ public PhaseExecutionInfo getPhaseExecutionInfo() {
return phaseExecutionInfo;
}

public Boolean isAutoRetryableError() {
return isAutoRetryableError;
}

public Integer getFailedStepRetryCount() {
return failedStepRetryCount;
}

public TimeValue getAge() {
if (lifecycleDate == null) {
return TimeValue.MINUS_ONE;
Expand Down Expand Up @@ -287,6 +321,12 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
if (Strings.hasLength(failedStep)) {
builder.field(FAILED_STEP_FIELD.getPreferredName(), failedStep);
}
if (isAutoRetryableError != null) {
builder.field(IS_AUTO_RETRYABLE_ERROR_FIELD.getPreferredName(), isAutoRetryableError);
}
if (failedStepRetryCount != null) {
builder.field(FAILED_STEP_RETRY_COUNT_FIELD.getPreferredName(), failedStepRetryCount);
}
if (stepInfo != null && stepInfo.length() > 0) {
builder.rawField(STEP_INFO_FIELD.getPreferredName(), stepInfo.streamInput(), XContentType.JSON);
}
Expand All @@ -300,8 +340,8 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws

@Override
public int hashCode() {
return Objects.hash(index, managedByILM, policyName, lifecycleDate, phase, action, step, failedStep, phaseTime, actionTime,
stepTime, stepInfo, phaseExecutionInfo);
return Objects.hash(index, managedByILM, policyName, lifecycleDate, phase, action, step, failedStep, isAutoRetryableError,
failedStepRetryCount, phaseTime, actionTime, stepTime, stepInfo, phaseExecutionInfo);
}

@Override
Expand All @@ -321,6 +361,8 @@ public boolean equals(Object obj) {
Objects.equals(action, other.action) &&
Objects.equals(step, other.step) &&
Objects.equals(failedStep, other.failedStep) &&
Objects.equals(isAutoRetryableError, other.isAutoRetryableError) &&
Objects.equals(failedStepRetryCount, other.failedStepRetryCount) &&
Objects.equals(phaseTime, other.phaseTime) &&
Objects.equals(actionTime, other.actionTime) &&
Objects.equals(stepTime, other.stepTime) &&
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,27 +30,33 @@ public class LifecycleExecutionState {
private static final String ACTION_TIME = "action_time";
private static final String STEP_TIME = "step_time";
private static final String FAILED_STEP = "failed_step";
private static final String IS_AUTO_RETRYABLE_ERROR = "is_auto_retryable_error";
private static final String FAILED_STEP_RETRY_COUNT = "failed_step_retry_count";
private static final String STEP_INFO = "step_info";
private static final String PHASE_DEFINITION = "phase_definition";

private final String phase;
private final String action;
private final String step;
private final String failedStep;
private final Boolean isAutoRetryableError;
private final Integer failedStepRetryCount;
private final String stepInfo;
private final String phaseDefinition;
private final Long lifecycleDate;
private final Long phaseTime;
private final Long actionTime;
private final Long stepTime;

private LifecycleExecutionState(String phase, String action, String step, String failedStep,
String stepInfo, String phaseDefinition, Long lifecycleDate,
private LifecycleExecutionState(String phase, String action, String step, String failedStep, Boolean isAutoRetryableError,
Integer failedStepRetryCount, String stepInfo, String phaseDefinition, Long lifecycleDate,
Long phaseTime, Long actionTime, Long stepTime) {
this.phase = phase;
this.action = action;
this.step = step;
this.failedStep = failedStep;
this.isAutoRetryableError = isAutoRetryableError;
this.failedStepRetryCount = failedStepRetryCount;
this.stepInfo = stepInfo;
this.phaseDefinition = phaseDefinition;
this.lifecycleDate = lifecycleDate;
Expand Down Expand Up @@ -82,6 +88,8 @@ public static Builder builder(LifecycleExecutionState state) {
.setAction(state.action)
.setStep(state.step)
.setFailedStep(state.failedStep)
.setIsAutoRetryableError(state.isAutoRetryableError)
.setFailedStepRetryCount(state.failedStepRetryCount)
.setStepInfo(state.stepInfo)
.setPhaseDefinition(state.phaseDefinition)
.setIndexCreationDate(state.lifecycleDate)
Expand All @@ -104,6 +112,12 @@ static LifecycleExecutionState fromCustomMetadata(Map<String, String> customData
if (customData.containsKey(FAILED_STEP)) {
builder.setFailedStep(customData.get(FAILED_STEP));
}
if (customData.containsKey(IS_AUTO_RETRYABLE_ERROR)) {
builder.setIsAutoRetryableError(Boolean.parseBoolean(customData.get(IS_AUTO_RETRYABLE_ERROR)));
}
if (customData.containsKey(FAILED_STEP_RETRY_COUNT)) {
builder.setFailedStepRetryCount(Integer.parseInt(customData.get(FAILED_STEP_RETRY_COUNT)));
}
if (customData.containsKey(STEP_INFO)) {
builder.setStepInfo(customData.get(STEP_INFO));
}
Expand Down Expand Up @@ -164,6 +178,12 @@ public Map<String, String> asMap() {
if (failedStep != null) {
result.put(FAILED_STEP, failedStep);
}
if (isAutoRetryableError != null) {
result.put(IS_AUTO_RETRYABLE_ERROR, String.valueOf(isAutoRetryableError));
}
if (failedStepRetryCount != null) {
result.put(FAILED_STEP_RETRY_COUNT, String.valueOf(failedStepRetryCount));
}
if (stepInfo != null) {
result.put(STEP_INFO, stepInfo);
}
Expand Down Expand Up @@ -201,6 +221,14 @@ public String getFailedStep() {
return failedStep;
}

public Boolean isAutoRetryableError() {
return isAutoRetryableError;
}

public Integer getFailedStepRetryCount() {
return failedStepRetryCount;
}

public String getStepInfo() {
return stepInfo;
}
Expand Down Expand Up @@ -230,22 +258,24 @@ public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
LifecycleExecutionState that = (LifecycleExecutionState) o;
return Objects.equals(getLifecycleDate(),that.getLifecycleDate()) &&
return Objects.equals(getLifecycleDate(), that.getLifecycleDate()) &&
Objects.equals(getPhaseTime(), that.getPhaseTime()) &&
Objects.equals(getActionTime(), that.getActionTime()) &&
Objects.equals(getStepTime(), that.getStepTime()) &&
Objects.equals(getPhase(), that.getPhase()) &&
Objects.equals(getAction(), that.getAction()) &&
Objects.equals(getStep(), that.getStep()) &&
Objects.equals(getFailedStep(), that.getFailedStep()) &&
Objects.equals(isAutoRetryableError(), that.isAutoRetryableError()) &&
Objects.equals(getFailedStepRetryCount(), that.getFailedStepRetryCount()) &&
Objects.equals(getStepInfo(), that.getStepInfo()) &&
Objects.equals(getPhaseDefinition(), that.getPhaseDefinition());
}

@Override
public int hashCode() {
return Objects.hash(getPhase(), getAction(), getStep(), getFailedStep(), getStepInfo(), getPhaseDefinition(),
getLifecycleDate(), getPhaseTime(), getActionTime(), getStepTime());
return Objects.hash(getPhase(), getAction(), getStep(), getFailedStep(), isAutoRetryableError(), getFailedStepRetryCount(),
getStepInfo(), getPhaseDefinition(), getLifecycleDate(), getPhaseTime(), getActionTime(), getStepTime());
}

public static class Builder {
Expand All @@ -259,6 +289,8 @@ public static class Builder {
private Long phaseTime;
private Long actionTime;
private Long stepTime;
private Boolean isAutoRetryableError;
private Integer failedStepRetryCount;

public Builder setPhase(String phase) {
this.phase = phase;
Expand Down Expand Up @@ -310,9 +342,19 @@ public Builder setStepTime(Long stepTime) {
return this;
}

public Builder setIsAutoRetryableError(Boolean isAutoRetryableError) {
this.isAutoRetryableError = isAutoRetryableError;
return this;
}

public Builder setFailedStepRetryCount(Integer failedStepRetryCount) {
this.failedStepRetryCount = failedStepRetryCount;
return this;
}

public LifecycleExecutionState build() {
return new LifecycleExecutionState(phase, action, step, failedStep, stepInfo, phaseDefinition, indexCreationDate,
phaseTime, actionTime, stepTime);
return new LifecycleExecutionState(phase, action, step, failedStep, isAutoRetryableError, failedStepRetryCount, stepInfo,
phaseDefinition, indexCreationDate, phaseTime, actionTime, stepTime);
}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ public class LifecycleSettings {
public static final Setting<Boolean> LIFECYCLE_PARSE_ORIGINATION_DATE_SETTING = Setting.boolSetting(LIFECYCLE_PARSE_ORIGINATION_DATE,
false, Setting.Property.Dynamic, Setting.Property.IndexScope);


public static final Setting<Boolean> SLM_HISTORY_INDEX_ENABLED_SETTING = Setting.boolSetting(SLM_HISTORY_INDEX_ENABLED, true,
Setting.Property.NodeScope);
public static final Setting<String> SLM_RETENTION_SCHEDULE_SETTING = Setting.simpleString(SLM_RETENTION_SCHEDULE,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,13 @@ public StepKey getNextStepKey() {
return nextStepKey;
}

/**
* Indicates if the step can be automatically retried when it encounters an execution error.
*/
public boolean isRetryable() {
return false;
}

@Override
public int hashCode() {
return Objects.hash(key, nextStepKey);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,11 @@ public WaitForRolloverReadyStep(StepKey key, StepKey nextStepKey, Client client,
this.maxDocs = maxDocs;
}

@Override
public boolean isRetryable() {
return true;
}

@Override
public void evaluateCondition(IndexMetaData indexMetaData, Listener listener) {
String rolloverAlias = RolloverAction.LIFECYCLE_ROLLOVER_ALIAS_SETTING.get(indexMetaData.getSettings());
Expand Down
Loading