-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Reset anomaly detection job API #73908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Reset anomaly detection job API #73908
Conversation
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good.
The comment about revert during revert not being safe is probably a nightmare to do anything about, so maybe ignore it in this PR given that we didn't prevent it in the past and nobody apparently noticed.
|
||
* Requires the `manage_ml` cluster privilege. This privilege is included in the | ||
`machine_learning_admin` built-in role. | ||
* Before you can reset a job, you must close it. See <<ml-close-job>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth adding a tip here that force
close would be the appropriate way to close the job, as there's no point waiting for a graceful close only to immediately delete all the state and results.
"force":{ | ||
"type":"boolean", | ||
"description":"True if the job should be forcefully reset", | ||
"default":false | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't mentioned in the docs. If it's for internal use only then it shouldn't be in the spec either, because having it in the spec will mean the clients expose it.
But I think there might be a case for renaming the internal parameter to leave force
available for eventual public exposure in a future version.
The current meaning of force
is that it's internal and allows you to reset a job while it's open. But in the future I think it will ease frustration if we have a public force
that allows you to reset an open job and starts off by force-closing it (like force-delete does). So that implies that our secret internal argument for working with open jobs should not be called force
.
@@ -857,6 +887,7 @@ public Builder setResultsIndexName(String resultsIndexName) { | |||
|
|||
public Builder setDeleting(boolean deleting) { | |||
this.deleting = deleting; | |||
this.blockReason = BlockReason.DELETE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.blockReason = BlockReason.DELETE; | |
if (deleting) { | |
blockReason = BlockReason.DELETE; | |
} else { | |
if (blockReason == BlockReason.DELETE) { | |
blockReason = null; | |
} | |
} |
@@ -865,6 +896,23 @@ public Builder setAllowLazyOpen(boolean allowLazyOpen) { | |||
return this; | |||
} | |||
|
|||
private Builder setBlockReason(String blockReason) { | |||
if (blockReason == null) { | |||
this.blockReason = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this.blockReason = null; | |
this.blockReason = null; | |
this.deleting = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we can do a deleting = false
here for BWC purposes (what if we parse a null
blockReason but deleting=true
from XContent???). This method is only used from the XContent parser.
However, we NEVER write out a null
value to the xcontent body, so I don't think this method is ever called anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now that we are indeed writing null
directly and by passing the job's ToXContent values. So, my concern here could occur.
if (this.blockReason == BlockReason.DELETE) { | ||
this.deleting = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (this.blockReason == BlockReason.DELETE) { | |
this.deleting = true; | |
} | |
if (this.blockReason == BlockReason.DELETE) { | |
this.deleting = true; | |
} else { | |
this.deleting = false; | |
} |
)); | ||
return; | ||
} | ||
if (job.getBlockReason() != null && job.getBlockReason() != BlockReason.REVERT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the way revert works I don't think you can safely revert to a different snapshot while an existing revert request for another snapshot is in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I don't think this should continue if the block reason has been set at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll remove the revert
clause.
Pinging @elastic/clients-team (Team:Clients) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the org.elasticsearch.xpack.security.operator.Constants
updated, so I bet that is why the build failed.
@@ -136,6 +137,7 @@ | |||
parser.declareString(Builder::setResultsIndexName, RESULTS_INDEX_NAME); | |||
parser.declareBoolean(Builder::setDeleting, DELETING); | |||
parser.declareBoolean(Builder::setAllowLazyOpen, ALLOW_LAZY_OPEN); | |||
parser.declareStringOrNull(Builder::setBlockReason, BLOCK_REASON); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't required. When writing the XContent body, block_reason
isn't written when its null
. So, it should never be a null
value when parsing from XContent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we update the job document to clear block_reason
we do set block_reason
to null
explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dimitris-athanasiou true, but is null
ever written to the xcontent builder and thus written to the doc? From what I can tell, it is not. It is simply not written to the doc at all.
There is a big difference between block_reason: null
and block_reason
not being in the doc at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is. See JobConfigProvider.updateJobBlockReason
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we are writing a map directly and updating the doc.
Why are we explicitly saying null
for the value instead of removing the field?
We now have two things that indicate that it isn't blocked, the field not existing or the field being explicitly null.
We set it to null
, then a user calls the _update
API, and then it gets unset (since we call ToXContent
). It just seems like a weird side-effect that could cause unknown issues.
if (this.blockReason == BlockReason.DELETE) { | ||
this.deleting = true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May I suggest
if (this.blockReason == BlockReason.DELETE) { | |
this.deleting = true; | |
} | |
this.deleting = (this.blockReason == BlockReason.DELETE); |
@@ -865,6 +896,23 @@ public Builder setAllowLazyOpen(boolean allowLazyOpen) { | |||
return this; | |||
} | |||
|
|||
private Builder setBlockReason(String blockReason) { | |||
if (blockReason == null) { | |||
this.blockReason = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we can do a deleting = false
here for BWC purposes (what if we parse a null
blockReason but deleting=true
from XContent???). This method is only used from the XContent parser.
However, we NEVER write out a null
value to the xcontent body, so I don't think this method is ever called anyways.
x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/ml/job/config/JobTests.java
Show resolved
Hide resolved
x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportResetJobAction.java
Outdated
Show resolved
Hide resolved
)); | ||
return; | ||
} | ||
if (job.getBlockReason() != null && job.getBlockReason() != BlockReason.REVERT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I don't think this should continue if the block reason has been set at all.
request.setShouldStoreResult(true); | ||
Task task = client.executeLocally(ResetJobAction.INSTANCE, request, nullTaskListener()); | ||
return channel -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't 100% understand this.
Does this create a "local task" on this coordinating node that waits around until the master action is complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
When `wait_for_completion` is set to `false`, the response contains the id | ||
of the job reset task: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be good to indicate that to check on the status of the task, you shouldn't call the same API again and instead should use the tasks
API.
3475e9a
to
4fc4efe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs LGTM. Thanks for adding them. I left two minor comments, please take or leave them!
* Before you can reset a job, you must close it. You can set `force` to `true` | ||
when closing the job in order to skip waiting for the job to finalize as it | ||
is about to be reset. See <<ml-close-job>>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Before you can reset a job, you must close it. You can set `force` to `true` | |
when closing the job in order to skip waiting for the job to finalize as it | |
is about to be reset. See <<ml-close-job>>. | |
* Before you can reset a job, you must close it. You can set `force` to `true` | |
when closing the job to avoid waiting for the job to complete. See | |
<<ml-close-job>>. |
"task": "oTUltX4IQMOUUVeiohTt8A:39" | ||
} | ||
---- | ||
// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/] | |
// TESTRESPONSE[s/"task": "oTUltX4IQMOUUVeiohTt8A:39"/"task": $body.task/] | |
If you want to check the status of the reset task, use the <<tasks>> by referencing | |
the task ID. |
Adds a new API that allows a user to reset an anomaly detection job. To use the API do: ``` POST _ml/anomaly_detectors/<job_id>_reset ``` The API removes all data associated to the job. In particular, it deletes model state, results and stats. However, job notifications and user annotations are not removed. Also, the API can be called asynchronously by setting the parameter `wait_for_completion` to `false` (defaults to `true`). When run that way the API returns the task id for further monitoring. In order to prevent the job from opening while it is resetting, a new job field has been added called `block_reason`. This can take a value from ["delete", "reset", "revert"] as all these operations should block the job from opening. The delete action has already been blocking the job by setting the `deleting` field. However, in order not to introduce different booleans for each action, `block_reason` should be the way to do this onwards. Finally, this commit also sets the `block_reason` to `revert` when the revert snapshot API is called as a job should not be opened while it is reverted to a different model snapshot.
5a23e75
to
5b8509d
Compare
I have reworked this in order to include the task id. The new job field now looks like:
I kept the update of the Could you please take another look? |
ListTasksRequest listTasksRequest = new ListTasksRequest(); | ||
listTasksRequest.setActions(ResetJobAction.NAME); | ||
listTasksRequest.setDescriptions(MlTasks.JOB_TASK_ID_PREFIX + jobId); | ||
listTasksRequest.setDetailed(true); | ||
executeAsyncWithOrigin(client, ML_ORIGIN, ListTasksAction.INSTANCE, listTasksRequest, listTasksListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have the task_id
in the blocked
object, is this still necessary?
)); | ||
} | ||
|
||
private void waitExistingResetTaskToComplete(TaskInfo existingTask, ResetJobAction.Request request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and the above list method can be adjusted to look at the current blocked.task_id
and if the task doesn't exist, start another one and update the doc.
One more thing I just remembered is that even though we have a shiny new auto-generated HLRC the manually created HLRC needs to be complete and supported until 7.last. So at some point (not necessarily in this PR), please can you add this new API to the HLRC? |
Yes, I was planning to do so in a separate PR that will go only on the |
run elasticsearch-ci/bwc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good.
It may be good to include the task_id in the error messages or something to indicate how they can check when the thing is over.
if (expectedInstance.equals(newInstance) == false) { | ||
int foo = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Temporary debug to be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oups! Well spotted!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, if you could remove that piece of temporary debug before merging
CancelTasksRequest cancelTasksRequest = new CancelTasksRequest(); | ||
cancelTasksRequest.setReason("deleting job"); | ||
cancelTasksRequest.setActions(ResetJobAction.NAME); | ||
cancelTasksRequest.setTaskId(job.getBlocked().getTaskId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that if Blocked.Reason.RESET
then job.getBlocked().getTaskId()
should never be null. But, if it is ever null, this would blow up.
Same goes for any other tasks request. If we make a tasks request with a null
taskId, this will go boom.
I didn't see anything in this PR that struck me as the only time the taskID could be null is if the Blocked.Reason.DELETE
and we don't ever check the task ID there.
Just wanted to make sure we all knew this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly.
@elasticmachine update branch |
Adds a new API that allows a user to reset an anomaly detection job. To use the API do: ``` POST _ml/anomaly_detectors/<job_id>_reset ``` The API removes all data associated to the job. In particular, it deletes model state, results and stats. However, job notifications and user annotations are not removed. Also, the API can be called asynchronously by setting the parameter `wait_for_completion` to `false` (defaults to `true`). When run that way the API returns the task id for further monitoring. In order to prevent the job from opening while it is resetting, a new job field has been added called `blocked`. It is an object that contains a `reason` and the `task_id`. `reason` can take a value from ["delete", "reset", "revert"] as all these operations should block the job from opening. The `task_id` is also included in order to allow tracking the task if necessary. Finally, this commit also sets the `blocked` field when the revert snapshot API is called as a job should not be opened while it is reverted to a different model snapshot. Backport of #73908
Adds a new API that allows a user to reset
an anomaly detection job.
To use the API do:
The API removes all data associated to the job.
In particular, it deletes model state, results and stats.
However, job notifications and user annotations are not removed.
Also, the API can be called asynchronously by setting the parameter
wait_for_completion
tofalse
(defaults totrue
). When runthat way the API returns the task id for further monitoring.
In order to prevent the job from opening while it is resetting,
a new job field has been added called
block_reason
. This cantake a value from ["delete", "reset", "revert"] as all these
operations should block the job from opening. The delete action
has already been blocking the job by setting the
deleting
field.However, in order not to introduce different booleans for each
action,
block_reason
should be the way to do this onwards.Finally, this commit also sets the
block_reason
torevert
whenthe revert snapshot API is called as a job should not be opened
while it is reverted to a different model snapshot.