Skip to content

[ML] Anomaly detection job reset can get stuck with no way to unblock #105928

Closed
@droberts195

Description

@droberts195

When an anomaly detection job is reset we start a local (non-persistent) task to do the resetting operations.

It is possible for this task to disappear before it completes. One well-known way this can happen is if the node it is running on dies while it is running.

Looking at the code, you'd think that a way to unblock the reset would be to call the reset endpoint again for the same job. It appears that resetIfJobIsStillBlockedOnReset will be called from here:

ActionListener.wrap(r -> resetIfJobIsStillBlockedOnReset(task, request, listener), listener::onFailure)

However, this doesn't happen if the original task disappeared without trace (for example on a node dying) rather than failed with an error. What actually happens in this case is that the second reset call returns an error like this:

{
  "error": {
    "root_cause": [
      {
        "type": "resource_not_found_exception",
        "reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
      }
    ],
    "type": "resource_not_found_exception",
    "reason": "task [U325otijTPOipQrqz-0SRQ:22784192] isn't running and hasn't stored its results"
  },
  "status": 404
}

This is coming from the error handler of the get task call cascading through the listeners from here:

Another problem is that if a reset ends up stalled because of a node dying, it's not very friendly to wait until a user retries it. Job deletions can have the same problem, and we retry those automatically as part of our nightly maintenance task.

Therefore, we should make two changes:

  1. Change the reset code so that we call resetIfJobIsStillBlockedOnReset not just in the success handler but also in the failure handler of a second or subsequent retry if ResourceNotFoundException is the cause of the failure.
  2. Extend the code in the ML daily maintenance task to check for resets that have no corresponding tasks and retry them in the same way that deletes without corresponding tasks are retried.

Metadata

Metadata

Assignees

Labels

:mlMachine learning>bugTeam:MLMeta label for the ML team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions