[ML] Anomaly detection job reset can get stuck with no way to unblock

When an anomaly detection job is reset we start a local (non-persistent) task to do the resetting operations.

It is possible for this task to disappear before it completes. One well-known way this can happen is if the node it is running on dies while it is running.

Looking at the code, you'd think that a way to unblock the reset would be to call the reset endpoint again for the same job. It appears that `resetIfJobIsStillBlockedOnReset` will be called from here: https://github.com/elastic/elasticsearch/blob/f77c16b78bab5e9f3ee24974aeba408e254e8c94/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java#L127

However, this doesn't happen if the original task disappeared without trace (for example on a node dying) rather than failed with an error. What actually happens in this case is that the second reset call returns an error like this:

```
{
  "error": {
    "root_cause": [
      {
        "type": "resource_not_found_exception",
        "reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
      }
    ],
    "type": "resource_not_found_exception",
    "reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
  },
  "status": 404
}
```

This is coming from the error handler of the get task call cascading through the listeners from here: https://github.com/elastic/elasticsearch/blob/f77c16b78bab5e9f3ee24974aeba408e254e8c94/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java#L164

Another problem is that if a reset ends up stalled because of a node dying, it's not very friendly to wait until a user retries it. Job deletions can have the same problem, and we retry those automatically as part of our nightly maintenance task.

Therefore, we should make two changes:

1. Change the reset code so that we call `resetIfJobIsStillBlockedOnReset` not just in the success handler but also in the failure handler of a second or subsequent retry if `ResourceNotFoundException` is the cause of the failure.
2. Extend the code in the ML daily maintenance task to check for resets that have no corresponding tasks and retry them in the same way that deletes without corresponding tasks are retried.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions