Skip to content

[ML] Reset anomaly detection job API #73908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

dimitris-athanasiou
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou commented Jun 8, 2021

Adds a new API that allows a user to reset
an anomaly detection job.

To use the API do:

POST _ml/anomaly_detectors/<job_id>/_reset

The API removes all data associated to the job.
In particular, it deletes model state, results and stats.

However, job notifications and user annotations are not removed.

Also, the API can be called asynchronously by setting the parameter
wait_for_completion to false (defaults to true). When run
that way the API returns the task id for further monitoring.

In order to prevent the job from opening while it is resetting,
a new job field has been added called block_reason. This can
take a value from ["delete", "reset", "revert"] as all these
operations should block the job from opening. The delete action
has already been blocking the job by setting the deleting field.
However, in order not to introduce different booleans for each
action, block_reason should be the way to do this onwards.

Finally, this commit also sets the block_reason to revert when
the revert snapshot API is called as a job should not be opened
while it is reverted to a different model snapshot.

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 8, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

The comment about revert during revert not being safe is probably a nightmare to do anything about, so maybe ignore it in this PR given that we didn't prevent it in the past and nobody apparently noticed.


* Requires the `manage_ml` cluster privilege. This privilege is included in the
`machine_learning_admin` built-in role.
* Before you can reset a job, you must close it. See <<ml-close-job>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a tip here that force close would be the appropriate way to close the job, as there's no point waiting for a graceful close only to immediately delete all the state and results.

Comment on lines 29 to 33
"force":{
"type":"boolean",
"description":"True if the job should be forcefully reset",
"default":false
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't mentioned in the docs. If it's for internal use only then it shouldn't be in the spec either, because having it in the spec will mean the clients expose it.

But I think there might be a case for renaming the internal parameter to leave force available for eventual public exposure in a future version.

The current meaning of force is that it's internal and allows you to reset a job while it's open. But in the future I think it will ease frustration if we have a public force that allows you to reset an open job and starts off by force-closing it (like force-delete does). So that implies that our secret internal argument for working with open jobs should not be called force.

@@ -857,6 +887,7 @@ public Builder setResultsIndexName(String resultsIndexName) {

public Builder setDeleting(boolean deleting) {
this.deleting = deleting;
this.blockReason = BlockReason.DELETE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.blockReason = BlockReason.DELETE;
if (deleting) {
blockReason = BlockReason.DELETE;
} else {
if (blockReason == BlockReason.DELETE) {
blockReason = null;
}
}

@@ -865,6 +896,23 @@ public Builder setAllowLazyOpen(boolean allowLazyOpen) {
return this;
}

private Builder setBlockReason(String blockReason) {
if (blockReason == null) {
this.blockReason = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this.blockReason = null;
this.blockReason = null;
this.deleting = false;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we can do a deleting = false here for BWC purposes (what if we parse a null blockReason but deleting=true from XContent???). This method is only used from the XContent parser.

However, we NEVER write out a null value to the xcontent body, so I don't think this method is ever called anyways.

Copy link
Member

@benwtrent benwtrent Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now that we are indeed writing null directly and by passing the job's ToXContent values. So, my concern here could occur.

Comment on lines 910 to 912
if (this.blockReason == BlockReason.DELETE) {
this.deleting = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (this.blockReason == BlockReason.DELETE) {
this.deleting = true;
}
if (this.blockReason == BlockReason.DELETE) {
this.deleting = true;
} else {
this.deleting = false;
}

));
return;
}
if (job.getBlockReason() != null && job.getBlockReason() != BlockReason.REVERT) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the way revert works I don't think you can safely revert to a different snapshot while an existing revert request for another snapshot is in progress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I don't think this should continue if the block reason has been set at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll remove the revert clause.

@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Jun 8, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the org.elasticsearch.xpack.security.operator.Constants updated, so I bet that is why the build failed.

@@ -136,6 +137,7 @@
parser.declareString(Builder::setResultsIndexName, RESULTS_INDEX_NAME);
parser.declareBoolean(Builder::setDeleting, DELETING);
parser.declareBoolean(Builder::setAllowLazyOpen, ALLOW_LAZY_OPEN);
parser.declareStringOrNull(Builder::setBlockReason, BLOCK_REASON);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't required. When writing the XContent body, block_reason isn't written when its null. So, it should never be a null value when parsing from XContent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we update the job document to clear block_reason we do set block_reason to null explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimitris-athanasiou true, but is null ever written to the xcontent builder and thus written to the doc? From what I can tell, it is not. It is simply not written to the doc at all.

There is a big difference between block_reason: null and block_reason not being in the doc at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is. See JobConfigProvider.updateJobBlockReason.

Copy link
Member

@benwtrent benwtrent Jun 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we are writing a map directly and updating the doc.

Why are we explicitly saying null for the value instead of removing the field?

We now have two things that indicate that it isn't blocked, the field not existing or the field being explicitly null.

We set it to null, then a user calls the _update API, and then it gets unset (since we call ToXContent). It just seems like a weird side-effect that could cause unknown issues.

Comment on lines 910 to 912
if (this.blockReason == BlockReason.DELETE) {
this.deleting = true;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May I suggest

Suggested change
if (this.blockReason == BlockReason.DELETE) {
this.deleting = true;
}
this.deleting = (this.blockReason == BlockReason.DELETE);

@@ -865,6 +896,23 @@ public Builder setAllowLazyOpen(boolean allowLazyOpen) {
return this;
}

private Builder setBlockReason(String blockReason) {
if (blockReason == null) {
this.blockReason = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we can do a deleting = false here for BWC purposes (what if we parse a null blockReason but deleting=true from XContent???). This method is only used from the XContent parser.

However, we NEVER write out a null value to the xcontent body, so I don't think this method is ever called anyways.

));
return;
}
if (job.getBlockReason() != null && job.getBlockReason() != BlockReason.REVERT) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I don't think this should continue if the block reason has been set at all.

Comment on lines +51 to +53
request.setShouldStoreResult(true);
Task task = client.executeLocally(ResetJobAction.INSTANCE, request, nullTaskListener());
return channel -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't 100% understand this.

Does this create a "local task" on this coordinating node that waits around until the master action is complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Comment on lines +72 to +75
When `wait_for_completion` is set to `false`, the response contains the id
of the job reset task:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be good to indicate that to check on the status of the task, you shouldn't call the same API again and instead should use the tasks API.

@dimitris-athanasiou dimitris-athanasiou force-pushed the anomaly-detection-reset-job-api branch from 3475e9a to 4fc4efe Compare June 9, 2021 13:50
Copy link
Contributor

@szabosteve szabosteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs LGTM. Thanks for adding them. I left two minor comments, please take or leave them!

Comment on lines 21 to 23
* Before you can reset a job, you must close it. You can set `force` to `true`
when closing the job in order to skip waiting for the job to finalize as it
is about to be reset. See <<ml-close-job>>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Before you can reset a job, you must close it. You can set `force` to `true`
when closing the job in order to skip waiting for the job to finalize as it
is about to be reset. See <<ml-close-job>>.
* Before you can reset a job, you must close it. You can set `force` to `true`
when closing the job to avoid waiting for the job to complete. See
<<ml-close-job>>.

"task": "oTUltX4IQMOUUVeiohTt8A:39"
}
----
// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/]
// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/]
If you want to check the status of the reset task, use the <<tasks>> by referencing
the task ID.

Adds a new API that allows a user to reset
an anomaly detection job.

To use the API do:

```
POST _ml/anomaly_detectors/<job_id>_reset
```

The API removes all data associated to the job.
In particular, it deletes model state, results and stats.

However, job notifications and user annotations are not removed.

Also, the API can be called asynchronously by setting the parameter
`wait_for_completion` to `false` (defaults to `true`). When run
that way the API returns the task id for further monitoring.

In order to prevent the job from opening while it is resetting,
a new job field has been added called `block_reason`. This can
take a value from ["delete", "reset", "revert"] as all these
operations should block the job from opening. The delete action
has already been blocking the job by setting the `deleting` field.
However, in order not to introduce different booleans for each
action, `block_reason` should be the way to do this onwards.

Finally, this commit also sets the `block_reason` to `revert` when
the revert snapshot API is called as a job should not be opened
while it is reverted to a different model snapshot.
@dimitris-athanasiou dimitris-athanasiou force-pushed the anomaly-detection-reset-job-api branch from 5a23e75 to 5b8509d Compare June 10, 2021 14:11
@dimitris-athanasiou
Copy link
Contributor Author

I have reworked this in order to include the task id. The new job field now looks like:

...
"blocked": {
  "reason": "reset",
  "task_id": "abc:123"
}
...

I kept the update of the blocked field internal. Revert now checks if there is a revert task running and if not allows to run even if the job is blocked due to revert. This means the user can just call revert again to fix a job stuck with revert (in case the node failed half-way, etc.)

Could you please take another look?

Comment on lines 363 to 367
ListTasksRequest listTasksRequest = new ListTasksRequest();
listTasksRequest.setActions(ResetJobAction.NAME);
listTasksRequest.setDescriptions(MlTasks.JOB_TASK_ID_PREFIX + jobId);
listTasksRequest.setDetailed(true);
executeAsyncWithOrigin(client, ML_ORIGIN, ListTasksAction.INSTANCE, listTasksRequest, listTasksListener);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have the task_id in the blocked object, is this still necessary?

));
}

private void waitExistingResetTaskToComplete(TaskInfo existingTask, ResetJobAction.Request request,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the above list method can be adjusted to look at the current blocked.task_id and if the task doesn't exist, start another one and update the doc.

@droberts195
Copy link
Contributor

One more thing I just remembered is that even though we have a shiny new auto-generated HLRC the manually created HLRC needs to be complete and supported until 7.last. So at some point (not necessarily in this PR), please can you add this new API to the HLRC?

@dimitris-athanasiou
Copy link
Contributor Author

@droberts195

One more thing I just remembered is that even though we have a shiny new auto-generated HLRC the manually created HLRC needs to be complete and supported until 7.last. So at some point (not necessarily in this PR), please can you add this new API to the HLRC?

Yes, I was planning to do so in a separate PR that will go only on the 7.x branch after the backport.

@dimitris-athanasiou
Copy link
Contributor Author

run elasticsearch-ci/bwc

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good.

It may be good to include the task_id in the error messages or something to indicate how they can check when the thing is over.

Comment on lines 86 to 88
if (expectedInstance.equals(newInstance) == false) {
int foo = 3;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temporary debug to be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oups! Well spotted!

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, if you could remove that piece of temporary debug before merging

CancelTasksRequest cancelTasksRequest = new CancelTasksRequest();
cancelTasksRequest.setReason("deleting job");
cancelTasksRequest.setActions(ResetJobAction.NAME);
cancelTasksRequest.setTaskId(job.getBlocked().getTaskId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that if Blocked.Reason.RESET then job.getBlocked().getTaskId() should never be null. But, if it is ever null, this would blow up.

Same goes for any other tasks request. If we make a tasks request with a null taskId, this will go boom.

I didn't see anything in this PR that struck me as the only time the taskID could be null is if the Blocked.Reason.DELETE and we don't ever check the task ID there.

Just wanted to make sure we all knew this :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly.

@dimitris-athanasiou
Copy link
Contributor Author

@elasticmachine update branch

@dimitris-athanasiou dimitris-athanasiou merged commit dc61a72 into elastic:master Jun 14, 2021
@dimitris-athanasiou dimitris-athanasiou deleted the anomaly-detection-reset-job-api branch June 14, 2021 15:56
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jun 15, 2021
dimitris-athanasiou added a commit that referenced this pull request Jun 15, 2021
Adds a new API that allows a user to reset
an anomaly detection job.

To use the API do:

```
POST _ml/anomaly_detectors/<job_id>_reset
```

The API removes all data associated to the job.
In particular, it deletes model state, results and stats.

However, job notifications and user annotations are not removed.

Also, the API can be called asynchronously by setting the parameter
`wait_for_completion` to `false` (defaults to `true`). When run
that way the API returns the task id for further monitoring.

In order to prevent the job from opening while it is resetting,
a new job field has been added called `blocked`. It is an object
that contains a `reason` and the `task_id`. `reason` can take
a value from ["delete", "reset", "revert"] as all these
operations should block the job from opening. The `task_id` is also
included in order to allow tracking the task if necessary.

Finally, this commit also sets the `blocked` field when
the revert snapshot API is called as a job should not be opened
while it is reverted to a different model snapshot.

Backport of #73908
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning Team:Clients Meta label for clients team Team:ML Meta label for the ML team v7.14.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants