-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Set df-analytics task state to failed when appropriate #43880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Set df-analytics task state to failed when appropriate #43880
Conversation
This introduces a `failed` state to which the data frame analytics persistent task is set to when something unexpected fails. It could be the process crashing, the results processor hitting some error, etc. The failure message is then captured and set on the task state. From there, it becomes available via the _stats API as `failure_reason`. The df-analytics stop API now has a `force` boolean parameter. This allows the user to call it for a failed task in order to reset it to `stopped` after we have ensured the failure has been communicated to the user. This commit also adds the analytics version in the persistent task params as this allows us to prevent tasks to run on unsuitable nodes in the future.
Pinging @elastic/ml-core |
x-pack/plugin/src/test/resources/rest-api-spec/api/ml.stop_data_frame_analytics.json
Outdated
Show resolved
Hide resolved
...e/src/main/java/org/elasticsearch/xpack/core/ml/action/GetDataFrameAnalyticsStatsAction.java
Show resolved
Hide resolved
@@ -3112,6 +3112,10 @@ public void testStopDataFrameAnalytics() throws Exception { | |||
StopDataFrameAnalyticsRequest request = new StopDataFrameAnalyticsRequest("my-analytics-config"); // <1> | |||
// end::stop-data-frame-analytics-request | |||
|
|||
//tag::stop-data-frame-analytics-request-force | |||
request.setForce(false); // <2> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in a different doc tag callout, I think the doc build will fail. You may need to move
request.setForce(false); // <2>
Up in side the // tag::stop-data-frame-analytics-request
tag
@@ -202,6 +206,9 @@ public XContentBuilder toXContent(XContentBuilder builder, Params params) throws | |||
public XContentBuilder toUnwrappedXContent(XContentBuilder builder) throws IOException { | |||
builder.field(DataFrameAnalyticsConfig.ID.getPreferredName(), id); | |||
builder.field("state", state.toString()); | |||
if (failureReason != null) { | |||
builder.field("failure_reason", failureReason); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it may be prudent to use a more generic reason
as the field name. If the state: failed
, then we know it failed. If we ever decide to populate reason
with other information to indicate upgrading
, stopping
, etc. it would be good to be able to use the same field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in the short term while we only populate the field when the job is failed
we might get questions about why reason
is empty most of the time.
For the question of why it's not assigned to a node during an upgrade this will already be available in assignment_explanation
.
I guess the question is whether to combine assignment_explanation
and failure_reason
into a single reason
field. I'm happy to keep them separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was to avoid having a state
object. And as the stats object already has assignment_explanation
, reason
seems to be confusing. If we need to add more things in the future I think we'll have to rethink the object structure a bit.
allowNoMatch = in.readBoolean(); | ||
force = in.readBoolean(); | ||
expandedIds = new HashSet<>(Arrays.asList(in.readStringArray())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use Set.of
in master, though it can make backporting a pain :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I steered away from it. I think we'll have the time to replace those as we move on :-)
@@ -178,13 +179,18 @@ void gatherStatsForStoppedTasks(List<String> expandedIds, GetDataFrameAnalyticsS | |||
PersistentTasksCustomMetaData tasks = clusterState.getMetaData().custom(PersistentTasksCustomMetaData.TYPE); | |||
PersistentTasksCustomMetaData.PersistentTask<?> analyticsTask = MlTasks.getDataFrameAnalyticsTask(concreteAnalyticsId, tasks); | |||
DataFrameAnalyticsState analyticsState = MlTasks.getDataFrameAnalyticsState(concreteAnalyticsId, tasks); | |||
String failureReason = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not always get the reason and return it (if the analyticsTask != null) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention was to make it clear that we only use reason
here for the failed state. I realise this is not airtight here but I'd rather we deal with it if we need to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
retest this |
@elasticmachine update branch |
@elasticmachine update branch |
Jenkins retest this please |
…stic#43880) This introduces a `failed` state to which the data frame analytics persistent task is set to when something unexpected fails. It could be the process crashing, the results processor hitting some error, etc. The failure message is then captured and set on the task state. From there, it becomes available via the _stats API as `failure_reason`. The df-analytics stop API now has a `force` boolean parameter. This allows the user to call it for a failed task in order to reset it to `stopped` after we have ensured the failure has been communicated to the user. This commit also adds the analytics version in the persistent task params as this allows us to prevent tasks to run on unsuitable nodes in the future.
) (#43906) This introduces a `failed` state to which the data frame analytics persistent task is set to when something unexpected fails. It could be the process crashing, the results processor hitting some error, etc. The failure message is then captured and set on the task state. From there, it becomes available via the _stats API as `failure_reason`. The df-analytics stop API now has a `force` boolean parameter. This allows the user to call it for a failed task in order to reset it to `stopped` after we have ensured the failure has been communicated to the user. This commit also adds the analytics version in the persistent task params as this allows us to prevent tasks to run on unsuitable nodes in the future.
This introduces a
failed
state to which the data frame analyticspersistent task is set to when something unexpected fails. It could
be the process crashing, the results processor hitting some error,
etc. The failure message is then captured and set on the task state.
From there, it becomes available via the _stats API as
failure_reason
.The df-analytics stop API now has a
force
boolean parameter. This allowsthe user to call it for a failed task in order to reset it to
stopped
afterwe have ensured the failure has been communicated to the user.
This commit also adds the analytics version in the persistent task
params as this allows us to prevent tasks to run on unsuitable nodes in
the future.