[ML] Reset anomaly detection job API #73908

dimitris-athanasiou · 2021-06-08T15:11:42Z

Adds a new API that allows a user to reset
an anomaly detection job.

To use the API do:

POST _ml/anomaly_detectors/<job_id>/_reset

The API removes all data associated to the job.
In particular, it deletes model state, results and stats.

However, job notifications and user annotations are not removed.

Also, the API can be called asynchronously by setting the parameter
wait_for_completion to false (defaults to true). When run
that way the API returns the task id for further monitoring.

In order to prevent the job from opening while it is resetting,
a new job field has been added called block_reason. This can
take a value from ["delete", "reset", "revert"] as all these
operations should block the job from opening. The delete action
has already been blocking the job by setting the deleting field.
However, in order not to introduce different booleans for each
action, block_reason should be the way to do this onwards.

Finally, this commit also sets the block_reason to revert when
the revert snapshot API is called as a job should not be opened
while it is reverted to a different model snapshot.

elasticmachine · 2021-06-08T15:11:46Z

Pinging @elastic/ml-core (Team:ML)

droberts195

Looks good.

The comment about revert during revert not being safe is probably a nightmare to do anything about, so maybe ignore it in this PR given that we didn't prevent it in the past and nobody apparently noticed.

droberts195 · 2021-06-08T15:31:19Z

docs/reference/ml/anomaly-detection/apis/reset-job.asciidoc

+
+* Requires the `manage_ml` cluster privilege. This privilege is included in the 
+`machine_learning_admin` built-in role.
+* Before you can reset a job, you must close it. See <<ml-close-job>>.


It might be worth adding a tip here that force close would be the appropriate way to close the job, as there's no point waiting for a graceful close only to immediately delete all the state and results.

droberts195 · 2021-06-08T15:33:13Z

rest-api-spec/src/main/resources/rest-api-spec/api/ml.reset_job.json

+      "force":{
+        "type":"boolean",
+        "description":"True if the job should be forcefully reset",
+        "default":false
+      },


This isn't mentioned in the docs. If it's for internal use only then it shouldn't be in the spec either, because having it in the spec will mean the clients expose it.

But I think there might be a case for renaming the internal parameter to leave force available for eventual public exposure in a future version.

The current meaning of force is that it's internal and allows you to reset a job while it's open. But in the future I think it will ease frustration if we have a public force that allows you to reset an open job and starts off by force-closing it (like force-delete does). So that implies that our secret internal argument for working with open jobs should not be called force.

droberts195 · 2021-06-08T15:37:05Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/Job.java

@@ -857,6 +887,7 @@ public Builder setResultsIndexName(String resultsIndexName) {

        public Builder setDeleting(boolean deleting) {
            this.deleting = deleting;
+            this.blockReason = BlockReason.DELETE;


Suggested change

this.blockReason = BlockReason.DELETE;

if (deleting) {

blockReason = BlockReason.DELETE;

} else {

if (blockReason == BlockReason.DELETE) {

blockReason = null;

}

}

droberts195 · 2021-06-08T15:38:20Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/Job.java

@@ -865,6 +896,23 @@ public Builder setAllowLazyOpen(boolean allowLazyOpen) {
            return this;
        }

+        private Builder setBlockReason(String blockReason) {
+            if (blockReason == null) {
+                this.blockReason = null;


Suggested change

this.blockReason = null;

this.blockReason = null;

this.deleting = false;

I don't know if we can do a deleting = false here for BWC purposes (what if we parse a null blockReason but deleting=true from XContent???). This method is only used from the XContent parser.

However, we NEVER write out a null value to the xcontent body, so I don't think this method is ever called anyways.

I see now that we are indeed writing null directly and by passing the job's ToXContent values. So, my concern here could occur.

droberts195 · 2021-06-08T15:38:54Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/Job.java

+            if (this.blockReason == BlockReason.DELETE) {
+                this.deleting = true;
+            }


Suggested change

if (this.blockReason == BlockReason.DELETE) {

this.deleting = true;

}

if (this.blockReason == BlockReason.DELETE) {

this.deleting = true;

} else {

this.deleting = false;

}

droberts195 · 2021-06-08T15:53:43Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportRevertModelSnapshotAction.java

+                            ));
+                            return;
+                        }
+                        if (job.getBlockReason() != null && job.getBlockReason() != BlockReason.REVERT) {


Given the way revert works I don't think you can safely revert to a different snapshot while an existing revert request for another snapshot is in progress.

Agreed, I don't think this should continue if the block reason has been set at all.

Makes sense, I'll remove the revert clause.

elasticmachine · 2021-06-08T18:34:55Z

Pinging @elastic/clients-team (Team:Clients)

benwtrent

I don't see the org.elasticsearch.xpack.security.operator.Constants updated, so I bet that is why the build failed.

benwtrent · 2021-06-08T19:07:01Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/Job.java

@@ -136,6 +137,7 @@
        parser.declareString(Builder::setResultsIndexName, RESULTS_INDEX_NAME);
        parser.declareBoolean(Builder::setDeleting, DELETING);
        parser.declareBoolean(Builder::setAllowLazyOpen, ALLOW_LAZY_OPEN);
+        parser.declareStringOrNull(Builder::setBlockReason, BLOCK_REASON);


This isn't required. When writing the XContent body, block_reason isn't written when its null. So, it should never be a null value when parsing from XContent

When we update the job document to clear block_reason we do set block_reason to null explicitly.

@dimitris-athanasiou true, but is null ever written to the xcontent builder and thus written to the doc? From what I can tell, it is not. It is simply not written to the doc at all.

There is a big difference between block_reason: null and block_reason not being in the doc at all.

It is. See JobConfigProvider.updateJobBlockReason.

Ah, we are writing a map directly and updating the doc.

Why are we explicitly saying null for the value instead of removing the field?

We now have two things that indicate that it isn't blocked, the field not existing or the field being explicitly null.

We set it to null, then a user calls the _update API, and then it gets unset (since we call ToXContent). It just seems like a weird side-effect that could cause unknown issues.

benwtrent · 2021-06-08T19:09:09Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/Job.java

+            if (this.blockReason == BlockReason.DELETE) {
+                this.deleting = true;
+            }


May I suggest

Suggested change

if (this.blockReason == BlockReason.DELETE) {

this.deleting = true;

}

this.deleting = (this.blockReason == BlockReason.DELETE);

benwtrent · 2021-06-08T19:11:01Z

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/config/Job.java

@@ -865,6 +896,23 @@ public Builder setAllowLazyOpen(boolean allowLazyOpen) {
            return this;
        }

+        private Builder setBlockReason(String blockReason) {
+            if (blockReason == null) {
+                this.blockReason = null;


I don't know if we can do a deleting = false here for BWC purposes (what if we parse a null blockReason but deleting=true from XContent???). This method is only used from the XContent parser.

However, we NEVER write out a null value to the xcontent body, so I don't think this method is ever called anyways.

x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ml/job/config/JobTests.java

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java

benwtrent · 2021-06-08T19:31:38Z

...n/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportRevertModelSnapshotAction.java

+                            ));
+                            return;
+                        }
+                        if (job.getBlockReason() != null && job.getBlockReason() != BlockReason.REVERT) {


Agreed, I don't think this should continue if the block reason has been set at all.

benwtrent · 2021-06-08T19:41:17Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/rest/job/RestResetJobAction.java

+            request.setShouldStoreResult(true);
+            Task task = client.executeLocally(ResetJobAction.INSTANCE, request, nullTaskListener());
+            return channel -> {


I don't 100% understand this.

Does this create a "local task" on this coordinating node that waits around until the master action is complete?

benwtrent · 2021-06-08T19:45:02Z

docs/reference/ml/anomaly-detection/apis/reset-job.asciidoc

+When `wait_for_completion` is set to `false`, the response contains the id
+of the job reset task:


it might be good to indicate that to check on the status of the task, you shouldn't call the same API again and instead should use the tasks API.

szabosteve

Docs LGTM. Thanks for adding them. I left two minor comments, please take or leave them!

szabosteve · 2021-06-09T14:50:53Z

docs/reference/ml/anomaly-detection/apis/reset-job.asciidoc

+* Before you can reset a job, you must close it. You can set `force` to `true`
+when closing the job in order to skip waiting for the job to finalize as it
+is about to be reset. See <<ml-close-job>>.


Suggested change

* Before you can reset a job, you must close it. You can set `force` to `true`

when closing the job in order to skip waiting for the job to finalize as it

is about to be reset. See <<ml-close-job>>.

* Before you can reset a job, you must close it. You can set `force` to `true`

when closing the job to avoid waiting for the job to complete. See

<<ml-close-job>>.

szabosteve · 2021-06-09T14:58:31Z

docs/reference/ml/anomaly-detection/apis/reset-job.asciidoc

+  "task": "oTUltX4IQMOUUVeiohTt8A:39"
+}
+----
+// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/]


Suggested change

// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/]

// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/]

If you want to check the status of the reset task, use the <<tasks>> by referencing

the task ID.

Adds a new API that allows a user to reset an anomaly detection job. To use the API do: ``` POST _ml/anomaly_detectors/<job_id>_reset ``` The API removes all data associated to the job. In particular, it deletes model state, results and stats. However, job notifications and user annotations are not removed. Also, the API can be called asynchronously by setting the parameter `wait_for_completion` to `false` (defaults to `true`). When run that way the API returns the task id for further monitoring. In order to prevent the job from opening while it is resetting, a new job field has been added called `block_reason`. This can take a value from ["delete", "reset", "revert"] as all these operations should block the job from opening. The delete action has already been blocking the job by setting the `deleting` field. However, in order not to introduce different booleans for each action, `block_reason` should be the way to do this onwards. Finally, this commit also sets the `block_reason` to `revert` when the revert snapshot API is called as a job should not be opened while it is reverted to a different model snapshot.

dimitris-athanasiou · 2021-06-10T14:16:03Z

I have reworked this in order to include the task id. The new job field now looks like:

...
"blocked": {
  "reason": "reset",
  "task_id": "abc:123"
}
...

I kept the update of the blocked field internal. Revert now checks if there is a revert task running and if not allows to run even if the job is blocked due to revert. This means the user can just call revert again to fix a job stuck with revert (in case the node failed half-way, etc.)

Could you please take another look?

benwtrent · 2021-06-10T14:30:25Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportDeleteJobAction.java

+        ListTasksRequest listTasksRequest = new ListTasksRequest();
+        listTasksRequest.setActions(ResetJobAction.NAME);
+        listTasksRequest.setDescriptions(MlTasks.JOB_TASK_ID_PREFIX + jobId);
+        listTasksRequest.setDetailed(true);
+        executeAsyncWithOrigin(client, ML_ORIGIN, ListTasksAction.INSTANCE, listTasksRequest, listTasksListener);


Now that we have the task_id in the blocked object, is this still necessary?

benwtrent · 2021-06-10T14:41:25Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java

+        ));
+    }
+
+    private void waitExistingResetTaskToComplete(TaskInfo existingTask, ResetJobAction.Request request,


I think this and the above list method can be adjusted to look at the current blocked.task_id and if the task doesn't exist, start another one and update the doc.

droberts195 · 2021-06-10T15:20:17Z

One more thing I just remembered is that even though we have a shiny new auto-generated HLRC the manually created HLRC needs to be complete and supported until 7.last. So at some point (not necessarily in this PR), please can you add this new API to the HLRC?

dimitris-athanasiou · 2021-06-10T15:34:28Z

@droberts195

One more thing I just remembered is that even though we have a shiny new auto-generated HLRC the manually created HLRC needs to be complete and supported until 7.last. So at some point (not necessarily in this PR), please can you add this new API to the HLRC?

Yes, I was planning to do so in a separate PR that will go only on the 7.x branch after the backport.

dimitris-athanasiou · 2021-06-13T15:32:21Z

run elasticsearch-ci/bwc

benwtrent

I think this looks good.

It may be good to include the task_id in the error messages or something to indicate how they can check when the thing is over.

droberts195 · 2021-06-14T13:28:25Z

test/framework/src/main/java/org/elasticsearch/test/AbstractWireTestCase.java

+        if (expectedInstance.equals(newInstance) == false) {
+            int foo = 3;
+        }


Temporary debug to be removed?

Oups! Well spotted!

droberts195

LGTM, if you could remove that piece of temporary debug before merging

benwtrent · 2021-06-14T13:52:00Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportDeleteJobAction.java

+                    CancelTasksRequest cancelTasksRequest = new CancelTasksRequest();
+                    cancelTasksRequest.setReason("deleting job");
+                    cancelTasksRequest.setActions(ResetJobAction.NAME);
+                    cancelTasksRequest.setTaskId(job.getBlocked().getTaskId());


I know that if Blocked.Reason.RESET then job.getBlocked().getTaskId() should never be null. But, if it is ever null, this would blow up.

Same goes for any other tasks request. If we make a tasks request with a null taskId, this will go boom.

I didn't see anything in this PR that struck me as the only time the taskID could be null is if the Blocked.Reason.DELETE and we don't ever check the task ID there.

Just wanted to make sure we all knew this :)

Yes, exactly.

dimitris-athanasiou · 2021-06-14T14:55:11Z

@elasticmachine update branch

Relates elastic#73908

Adds a new API that allows a user to reset an anomaly detection job. To use the API do: ``` POST _ml/anomaly_detectors/<job_id>_reset ``` The API removes all data associated to the job. In particular, it deletes model state, results and stats. However, job notifications and user annotations are not removed. Also, the API can be called asynchronously by setting the parameter `wait_for_completion` to `false` (defaults to `true`). When run that way the API returns the task id for further monitoring. In order to prevent the job from opening while it is resetting, a new job field has been added called `blocked`. It is an object that contains a `reason` and the `task_id`. `reason` can take a value from ["delete", "reset", "revert"] as all these operations should block the job from opening. The `task_id` is also included in order to allow tracking the task if necessary. Finally, this commit also sets the `blocked` field when the revert snapshot API is called as a job should not be opened while it is reverted to a different model snapshot. Backport of #73908

Relates #73908

dimitris-athanasiou added >enhancement :ml Machine learning v8.0.0 v7.14.0 labels Jun 8, 2021

dimitris-athanasiou requested a review from droberts195 June 8, 2021 15:11

elasticmachine added the Team:ML Meta label for the ML team label Jun 8, 2021

dimitris-athanasiou requested a review from szabosteve June 8, 2021 15:11

droberts195 reviewed Jun 8, 2021

View reviewed changes

sethmlarson added the Team:Clients Meta label for clients team label Jun 8, 2021

benwtrent reviewed Jun 8, 2021

View reviewed changes

dimitris-athanasiou force-pushed the anomaly-detection-reset-job-api branch from 3475e9a to 4fc4efe Compare June 9, 2021 13:50

szabosteve approved these changes Jun 9, 2021

View reviewed changes

dimitris-athanasiou added 10 commits June 10, 2021 16:33

Fix formatting

b3b69c9

Include reset api doc in index

04b8395

Execute task requests with ml origin

44c527f

Address review comments

6bc16e3

Fix moved package

db06fa1

Prevent reset during upgrade mode

3e623e5

Fix TimeValue package in ResetJobIT

b7419ca

Add reset action to operator list

dfc6538

Store the blocking task_id along with the reason

5b8509d

dimitris-athanasiou force-pushed the anomaly-detection-reset-job-api branch from 5a23e75 to 5b8509d Compare June 10, 2021 14:11

Address Istvan's docs comments

221b21f

Fix task reference in docs

2c9e0d8

benwtrent reviewed Jun 10, 2021

View reviewed changes

dimitris-athanasiou added 3 commits June 11, 2021 13:02

No need to list tasks

d2e6563

Fix tests

d5afc5d

Fix more tests

a447ff7

benwtrent approved these changes Jun 14, 2021

View reviewed changes

droberts195 reviewed Jun 14, 2021

View reviewed changes

droberts195 approved these changes Jun 14, 2021

View reviewed changes

benwtrent reviewed Jun 14, 2021

View reviewed changes

Remove temporary debug code

04f5719

Merge branch 'master' into anomaly-detection-reset-job-api

0b1110f

dimitris-athanasiou merged commit dc61a72 into elastic:master Jun 14, 2021

dimitris-athanasiou deleted the anomaly-detection-reset-job-api branch June 14, 2021 15:56

dimitris-athanasiou mentioned this pull request Jun 14, 2021

[7.x][ML] Reset anomaly detection job API (#73908) #74093

Merged

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jun 15, 2021

[ML] Adjust BWC versions after backporting reset API

8d9371c

Relates elastic#73908

dimitris-athanasiou mentioned this pull request Jun 15, 2021

[ML] Adjust BWC versions after backporting reset API #74107

Merged

dimitris-athanasiou added a commit that referenced this pull request Jun 15, 2021

[ML] Adjust BWC versions after backporting reset API (#74107)

6bc0916

Relates #73908

This was referenced Jun 15, 2021

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown failing #74101

Closed

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74141

Closed

This was referenced Jun 18, 2021

[ML] Better handling of data counts on model snapshot reversion #65414

Open

[ML] Add reset action for anomaly detection jobs elastic/kibana#102661

Closed

jgowdyelastic mentioned this pull request Jul 19, 2021

ML job reset endpoint elastic/elasticsearch-specification#494

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

jgowdyelastic mentioned this pull request Aug 12, 2021

[ML] Adding reset anomaly detection jobs link to jobs list elastic/kibana#108039

Merged

6 tasks

lcawl mentioned this pull request Oct 27, 2021

[DOCS] Adds reset jobs API to anomaly detection tutorial elastic/stack-docs#1859

Merged

	this.blockReason = null;
	this.blockReason = null;
	this.deleting = false;

		When `wait_for_completion` is set to `false`, the response contains the id
		of the job reset task:

[ML] Reset anomaly detection job API #73908

[ML] Reset anomaly detection job API #73908

Uh oh!

Conversation

dimitris-athanasiou commented Jun 8, 2021 • edited by droberts195 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Jun 8, 2021

Uh oh!

droberts195 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Jun 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Jun 8, 2021

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Jun 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szabosteve left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou commented Jun 10, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droberts195 commented Jun 10, 2021

Uh oh!

dimitris-athanasiou commented Jun 8, 2021 •

edited by droberts195

Loading

benwtrent Jun 9, 2021 •

edited

Loading

benwtrent Jun 9, 2021 •

edited

Loading