Skip to content

[ML] Set df-analytics task state to failed when appropriate #43880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

dimitris-athanasiou
Copy link
Contributor

This introduces a failed state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as failure_reason.

The df-analytics stop API now has a force boolean parameter. This allows
the user to call it for a failed task in order to reset it to stopped after
we have ensured the failure has been communicated to the user.

This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.

This introduces a `failed` state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as `failure_reason`.

The df-analytics stop API now has a `force` boolean parameter. This allows
the user to call it for a failed task in order to reset it to `stopped` after
we have ensured the failure has been communicated to the user.

This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@@ -3112,6 +3112,10 @@ public void testStopDataFrameAnalytics() throws Exception {
StopDataFrameAnalyticsRequest request = new StopDataFrameAnalyticsRequest("my-analytics-config"); // <1>
// end::stop-data-frame-analytics-request

//tag::stop-data-frame-analytics-request-force
request.setForce(false); // <2>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in a different doc tag callout, I think the doc build will fail. You may need to move

request.setForce(false); // <2>

Up in side the // tag::stop-data-frame-analytics-request tag

@@ -202,6 +206,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
public XContentBuilder toUnwrappedXContent(XContentBuilder builder) throws IOException {
builder.field(DataFrameAnalyticsConfig.ID.getPreferredName(), id);
builder.field("state", state.toString());
if (failureReason != null) {
builder.field("failure_reason", failureReason);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be prudent to use a more generic reason as the field name. If the state: failed, then we know it failed. If we ever decide to populate reason with other information to indicate upgrading, stopping, etc. it would be good to be able to use the same field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the short term while we only populate the field when the job is failed we might get questions about why reason is empty most of the time.

For the question of why it's not assigned to a node during an upgrade this will already be available in assignment_explanation.

I guess the question is whether to combine assignment_explanation and failure_reason into a single reason field. I'm happy to keep them separate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was to avoid having a state object. And as the stats object already has assignment_explanation, reason seems to be confusing. If we need to add more things in the future I think we'll have to rethink the object structure a bit.

allowNoMatch = in.readBoolean();
force = in.readBoolean();
expandedIds = new HashSet<>(Arrays.asList(in.readStringArray()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use Set.of in master, though it can make backporting a pain :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I steered away from it. I think we'll have the time to replace those as we move on :-)

@@ -178,13 +179,18 @@ void gatherStatsForStoppedTasks(List<String> expandedIds, GetDataFrameAnalyticsS
PersistentTasksCustomMetaData tasks = clusterState.getMetaData().custom(PersistentTasksCustomMetaData.TYPE);
PersistentTasksCustomMetaData.PersistentTask<?> analyticsTask = MlTasks.getDataFrameAnalyticsTask(concreteAnalyticsId, tasks);
DataFrameAnalyticsState analyticsState = MlTasks.getDataFrameAnalyticsState(concreteAnalyticsId, tasks);
String failureReason = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not always get the reason and return it (if the analyticsTask != null) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention was to make it clear that we only use reason here for the failed state. I realise this is not airtight here but I'd rather we deal with it if we need to.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dimitris-athanasiou
Copy link
Contributor Author

retest this

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@benwtrent
Copy link
Member

Jenkins retest this please

@dimitris-athanasiou dimitris-athanasiou merged commit d6f36a8 into elastic:master Jul 3, 2019
@dimitris-athanasiou dimitris-athanasiou deleted the set-df-analytics-task-state-to-failed-when-appropriate branch July 3, 2019 07:59
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jul 3, 2019
…stic#43880)

This introduces a `failed` state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as `failure_reason`.

The df-analytics stop API now has a `force` boolean parameter. This allows
the user to call it for a failed task in order to reset it to `stopped` after
we have ensured the failure has been communicated to the user.

This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.
dimitris-athanasiou added a commit that referenced this pull request Jul 3, 2019
) (#43906)

This introduces a `failed` state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as `failure_reason`.

The df-analytics stop API now has a `force` boolean parameter. This allows
the user to call it for a failed task in order to reset it to `stopped` after
we have ensured the failure has been communicated to the user.

This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.
@droberts195 droberts195 removed the :ml/Transform Transform label Jul 8, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants