[ML] Set df-analytics task state to failed when appropriate #43880

dimitris-athanasiou · 2019-07-02T14:25:25Z

This introduces a failed state to which the data frame analytics
persistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as failure_reason.

The df-analytics stop API now has a force boolean parameter. This allows
the user to call it for a failed task in order to reset it to stopped after
we have ensured the failure has been communicated to the user.

This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.

This introduces a `failed` state to which the data frame analytics persistent task is set to when something unexpected fails. It could be the process crashing, the results processor hitting some error, etc. The failure message is then captured and set on the task state. From there, it becomes available via the _stats API as `failure_reason`. The df-analytics stop API now has a `force` boolean parameter. This allows the user to call it for a failed task in order to reset it to `stopped` after we have ensured the failure has been communicated to the user. This commit also adds the analytics version in the persistent task params as this allows us to prevent tasks to run on unsuitable nodes in the future.

elasticmachine · 2019-07-02T14:25:27Z

Pinging @elastic/ml-core

x-pack/plugin/src/test/resources/rest-api-spec/api/ml.stop_data_frame_analytics.json

...e/src/main/java/org/elasticsearch/xpack/core/ml/action/GetDataFrameAnalyticsStatsAction.java

benwtrent · 2019-07-02T15:26:37Z

...high-level/src/test/java/org/elasticsearch/client/documentation/MlClientDocumentationIT.java

@@ -3112,6 +3112,10 @@ public void testStopDataFrameAnalytics() throws Exception {
            StopDataFrameAnalyticsRequest request = new StopDataFrameAnalyticsRequest("my-analytics-config"); // <1>
            // end::stop-data-frame-analytics-request

+            //tag::stop-data-frame-analytics-request-force
+            request.setForce(false); // <2>


Since this is in a different doc tag callout, I think the doc build will fail. You may need to move

request.setForce(false); // <2>

Up in side the // tag::stop-data-frame-analytics-request tag

benwtrent · 2019-07-02T15:28:28Z

...e/src/main/java/org/elasticsearch/xpack/core/ml/action/GetDataFrameAnalyticsStatsAction.java

@@ -202,6 +206,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws
            public XContentBuilder toUnwrappedXContent(XContentBuilder builder) throws IOException {
                builder.field(DataFrameAnalyticsConfig.ID.getPreferredName(), id);
                builder.field("state", state.toString());
+                if (failureReason != null) {
+                    builder.field("failure_reason", failureReason);


I think it may be prudent to use a more generic reason as the field name. If the state: failed, then we know it failed. If we ever decide to populate reason with other information to indicate upgrading, stopping, etc. it would be good to be able to use the same field.

But in the short term while we only populate the field when the job is failed we might get questions about why reason is empty most of the time.

For the question of why it's not assigned to a node during an upgrade this will already be available in assignment_explanation.

I guess the question is whether to combine assignment_explanation and failure_reason into a single reason field. I'm happy to keep them separate.

My thinking was to avoid having a state object. And as the stats object already has assignment_explanation, reason seems to be confusing. If we need to add more things in the future I think we'll have to rethink the object structure a bit.

benwtrent · 2019-07-02T15:30:20Z

.../core/src/main/java/org/elasticsearch/xpack/core/ml/action/StopDataFrameAnalyticsAction.java

            allowNoMatch = in.readBoolean();
+            force = in.readBoolean();
+            expandedIds = new HashSet<>(Arrays.asList(in.readStringArray()));


You can use Set.of in master, though it can make backporting a pain :)

That's why I steered away from it. I think we'll have the time to replace those as we move on :-)

benwtrent · 2019-07-02T15:35:46Z

...c/main/java/org/elasticsearch/xpack/ml/action/TransportGetDataFrameAnalyticsStatsAction.java

@@ -178,13 +179,18 @@ void gatherStatsForStoppedTasks(List<String> expandedIds, GetDataFrameAnalyticsS
        PersistentTasksCustomMetaData tasks = clusterState.getMetaData().custom(PersistentTasksCustomMetaData.TYPE);
        PersistentTasksCustomMetaData.PersistentTask<?> analyticsTask = MlTasks.getDataFrameAnalyticsTask(concreteAnalyticsId, tasks);
        DataFrameAnalyticsState analyticsState = MlTasks.getDataFrameAnalyticsState(concreteAnalyticsId, tasks);
+        String failureReason = null;


Why not always get the reason and return it (if the analyticsTask != null) ?

The intention was to make it clear that we only use reason here for the failed state. I realise this is not airtight here but I'd rather we deal with it if we need to.

droberts195

LGTM

dimitris-athanasiou · 2019-07-02T17:01:53Z

retest this

dimitris-athanasiou · 2019-07-02T17:02:50Z

@elasticmachine update branch

…-appropriate

dimitris-athanasiou · 2019-07-02T22:31:52Z

@elasticmachine update branch

…-appropriate

benwtrent · 2019-07-03T01:57:33Z

Jenkins retest this please

…stic#43880) This introduces a `failed` state to which the data frame analytics persistent task is set to when something unexpected fails. It could be the process crashing, the results processor hitting some error, etc. The failure message is then captured and set on the task state. From there, it becomes available via the _stats API as `failure_reason`. The df-analytics stop API now has a `force` boolean parameter. This allows the user to call it for a failed task in order to reset it to `stopped` after we have ensured the failure has been communicated to the user. This commit also adds the analytics version in the persistent task params as this allows us to prevent tasks to run on unsuitable nodes in the future.

) (#43906) This introduces a `failed` state to which the data frame analytics persistent task is set to when something unexpected fails. It could be the process crashing, the results processor hitting some error, etc. The failure message is then captured and set on the task state. From there, it becomes available via the _stats API as `failure_reason`. The df-analytics stop API now has a `force` boolean parameter. This allows the user to call it for a failed task in order to reset it to `stopped` after we have ensured the failure has been communicated to the user. This commit also adds the analytics version in the persistent task params as this allows us to prevent tasks to run on unsuitable nodes in the future.

dimitris-athanasiou added >non-issue :ml Machine learning v8.0.0 :ml/Transform Transform v7.3.0 labels Jul 2, 2019

dimitris-athanasiou requested a review from droberts195 July 2, 2019 14:26

droberts195 reviewed Jul 2, 2019

View reviewed changes

x-pack/plugin/src/test/resources/rest-api-spec/api/ml.stop_data_frame_analytics.json Outdated Show resolved Hide resolved

...e/src/main/java/org/elasticsearch/xpack/core/ml/action/GetDataFrameAnalyticsStatsAction.java Show resolved Hide resolved

Address first review comments

1bca37c

benwtrent reviewed Jul 2, 2019

View reviewed changes

droberts195 approved these changes Jul 2, 2019

View reviewed changes

Fix HLRC doc issue with reference to force

2ad2d8b

Merge branch 'master' into set-df-analytics-task-state-to-failed-when…

43cb1d3

…-appropriate

benwtrent approved these changes Jul 2, 2019

View reviewed changes

Merge branch 'master' into set-df-analytics-task-state-to-failed-when…

4cec4c0

…-appropriate

dimitris-athanasiou merged commit d6f36a8 into elastic:master Jul 3, 2019

dimitris-athanasiou deleted the set-df-analytics-task-state-to-failed-when-appropriate branch July 3, 2019 07:59

droberts195 removed the :ml/Transform Transform label Jul 8, 2019

droberts195 mentioned this pull request Aug 19, 2020

[ML] Ensure data frame analytics jobs don't run on a node that's too new #61325

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Set df-analytics task state to failed when appropriate #43880

[ML] Set df-analytics task state to failed when appropriate #43880

dimitris-athanasiou commented Jul 2, 2019

elasticmachine commented Jul 2, 2019

benwtrent Jul 2, 2019

benwtrent Jul 2, 2019

droberts195 Jul 2, 2019

dimitris-athanasiou Jul 2, 2019

benwtrent Jul 2, 2019

dimitris-athanasiou Jul 2, 2019

benwtrent Jul 2, 2019

dimitris-athanasiou Jul 2, 2019

droberts195 left a comment

dimitris-athanasiou commented Jul 2, 2019

dimitris-athanasiou commented Jul 2, 2019

dimitris-athanasiou commented Jul 2, 2019

benwtrent commented Jul 3, 2019

[ML] Set df-analytics task state to failed when appropriate #43880

[ML] Set df-analytics task state to failed when appropriate #43880

Conversation

dimitris-athanasiou commented Jul 2, 2019

elasticmachine commented Jul 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

dimitris-athanasiou commented Jul 2, 2019

dimitris-athanasiou commented Jul 2, 2019

dimitris-athanasiou commented Jul 2, 2019

benwtrent commented Jul 3, 2019