-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Improve cleanup for model snapshot upgrades #81831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Improve cleanup for model snapshot upgrades #81831
Conversation
If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Fixes elastic#81578
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than my comments, this looks good go to.
PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID); | ||
PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID); | |
PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID); | |
PARSER.declareString(Request::setJobId, Job.ID); | |
PARSER.declareString(Request::setSnapshotId, SNAPSHOT_ID); |
This seems OK as we are not allowing null
in the parser declarations
public class CancelJobModelSnapshotUpgradeAction extends ActionType<CancelJobModelSnapshotUpgradeAction.Response> { | ||
|
||
public static final CancelJobModelSnapshotUpgradeAction INSTANCE = new CancelJobModelSnapshotUpgradeAction(); | ||
public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel"; | |
public static final String NAME = "cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel"; |
Since there is no rest layer for this, it seems prudent to be internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I intentionally made this admin
instead of internal
. Although there's no REST layer today it's possible that we might need to add one in the future, say if we ever want the UI to be able to clean up a failed upgrade.
The reason we've used internal
instead of admin
in the past is to make it hard for users of the transport client to call the API, even though it's easy for them to write the code to call it. Since there's nothing inherently wrong with calling this endpoint if you want to clean up failed upgrades I don't think that's a problem in this case. (Plus it's only in 7.17 that the transport client will even exist - after that lack of a REST layer will make this impossible to call.) I will add a comment to say this.
Part of the reason I didn't add a REST layer in this PR was because it's being added so late in the 7.x development cycle and it's not strictly essential at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal
conveys a message even if just a convention. If this was to become a public API with a REST interface we would create a new admin
wrapper action that calls this one and have this action internal
. We have many internal APIs called this way.
I don't consider this a blocker on the PR but there is a path for making internal
APIs public, if you think it likely there will be a REST API then keep it admin
otherwise it makes more sense to be internal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's 50/50 whether this will be made public in the future.
If this was to become a public API with a REST interface we would create a new
admin
wrapper action that calls this one and have this actioninternal
.
This seems like needless baggage to me. The wrapper action would literally call the internal action and do nothing else. It would just be extra boilerplate code.
internal
conveys a message even if just a convention.
It's only the ML team that has ever done this. All the internal actions of core Elasticsearch start with either admin
or monitor
. (For example look at the persistent tasks actions.) The original reason why the ML team did this was to make transport client users think twice before calling an API that was intended to be internal. Of course this only had an effect if security was enabled, but it was in the days of X-Pack as a separate plugin where ML users were likely to be using security but most users weren't using security. It was also in the days before the transport client was deprecated so we were more worried about transport client users accidentally automating things they shouldn't.
Be as I said in the comment I added to the code, in this case we wouldn't really care if a transport client user did call this action. It doesn't do something low level that will mess up the cluster if called at the wrong time. It's an action that could be external, just isn't at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that as the transport client is not used directly any more we don't really gain much by calling the action internal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its good we had this discussion then as internal
is a convention we have been using for a long time. If we chose to stop doing this then we have already established the usage of naming internal actions internal e.g. InternalInferModelAction
perhaps we can switch to that convention for actions that are purely internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The key difference with this action is that it doesn't need to be internal. It should be possible to make it external in the future with as little extra effort as possible. Whereas InternalInferModelAction
is specifically designed to be internal, and it would cause problems if people ever started calling it externally. An action that's more similar to this one is DeleteExpiredDataAction
, where its primary purpose is to be called by the daily maintenance task, but it doesn't hurt to call it externally. DeleteExpiredDataAction
does have a corresponding REST endpoint, but if we'd wanted to avoid that initially to avoid yet another endpoint that shows up in our list of documented endpoints then we could have deferred adding the REST endpoint to a release after the action was added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only the ML team that has ever done this. All the internal actions of core Elasticsearch start with either admin or monitor.
I agree that as the transport client is not used directly any more we don't really gain much by calling the action internal.
I was replying to these points stating that the original reason for using internal
no longer applies and that and suggesting it can be discontinued with the demise of the transport client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was replying to these points stating that the original reason for using
internal
no longer applies and that and suggesting it can be discontinued with the demise of the transport client.
Yes, sure, for new actions we can discontinue it. But for the existing ones that contain internal
changing it is a non-trivial piece of work, so we might as well leave them as-is.
@@ -152,6 +152,7 @@ | |||
"cluster:admin/xpack/ml/job/model_snapshots/revert", | |||
"cluster:admin/xpack/ml/job/model_snapshots/update", | |||
"cluster:admin/xpack/ml/job/model_snapshots/upgrade", | |||
"cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel", | |
"cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel", |
super(NAME, Response::new); | ||
} | ||
|
||
public static class Request extends BaseTasksRequest<Request> implements ToXContentObject { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request is a BaseTasksRequest
yet the transport action is HandledTransportAction
not a TransportTasksAction
.
public static class Request extends BaseTasksRequest<Request> implements ToXContentObject { | |
public static class Request extends ActionRequest<Request> implements ToXContentObject { |
// cannot be as thorough. | ||
List<PersistentTasksCustomMetadata.PersistentTask<?>> upgradeTasksToCancel = MlTasks.snapshotUpgradeTasks(tasksInProgress) | ||
.stream() | ||
.filter(t -> jobIds.contains(((SnapshotUpgradeTaskParams) t.getParams()).getJobId())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a SimpleIdsMatcher
for job IDs and save a call to expandJobs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expandJobs
handles allow_no_match
. It's impossible in this action for allow_no_match
to work with the snapshot IDs, because the upgrade tasks are ephemeral and stopping a stopped thing is not an error. But allow_no_match
can work with the job IDs.
This also ties in with #81831 (comment) because if we were only ever going to use this action internally then it wouldn't be so important to support allow_no_match
at all. But I can imagine in the future we might want to call this action from an enhanced UI. We'll need to add a REST endpoint at that time and it's nice if the machinery of allow_no_match
is available. The way allow_no_match
is handled for jobs in this action is the same as in the recently added model snapshot upgrades stats action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference to the GET stats API which enforces that the snapshot IDs are also matched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I've done is as sensible as it can be given the way the persistent tasks appear and disappear.
Consider these scenarios:
1
- User asks for stats for job foo, upgrade of snapshot 1
- Job foo doesn't exist
- We return not found
- This isn't controversial
2
- User asks for stats for job foo, upgrade of snapshot 1
- Job foo exists but snapshot 1 isn't being upgraded
- We return not found
- This one is potentially controversial, as the snapshot upgrade might only just have finished, but since it's not running we cannot return any stats. The only things we could do to make this better would be (1) persist stats for snapshot upgrades that have finished or (2) make upgrading a snapshot into a multi-step process where you first open the upgrade, then later close it (like you do for jobs). Both of these things bring considerable extra baggage to the upgrade process.
3
- User asks for stats for job foo, upgrade of snapshot 1
- Job foo exists and snapshot 1 is being upgraded
- We return the stats
- This isn't controversial
4
- User asks to cancel upgrade of snapshot 1 for job foo
- Job foo doesn't exist
- We return not found
- This isn't controversial
5
- User asks to cancel upgrade of snapshot 1 for job foo
- Job foo exists but snapshot 1 isn't being upgraded
- We do nothing and return success
- This one is potentially controversial, but the end state is what the user wanted and the snapshot upgrade might have just finished in between the user listing the upgrades in progress and asking to cancel this one. This is also consistent with closing a closed job is not an error and stopping a stopped datafeed is not an error.
6
- User asks to cancel upgrade of snapshot 1 for job foo
- Job foo exists and snapshot 1 is being upgraded
- We cancel the upgrade and return success
- This isn't controversial
I don't think there's a better way.
If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of elastic#81831
* [7.17] [ML] Improve cleanup for model snapshot upgrades If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of #81831 * Formatting
If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of #81831
If a model snapshot upgrade persistent task is cancelled then
we now kill any associated C++ process. Previously the C++ process
could hang indefinitely.
Additionally, ML feature reset now cancels any in-progress model
snapshot upgrades before cleaning up job data, and deleting an
anomaly detection job cancels any in-progress model snapshot
upgrades associated with that job before cleaning up the job's
data.
Fixes #81578