Skip to content

[ML] Improve cleanup for model snapshot upgrades #81831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

droberts195
Copy link
Contributor

If a model snapshot upgrade persistent task is cancelled then
we now kill any associated C++ process. Previously the C++ process
could hang indefinitely.

Additionally, ML feature reset now cancels any in-progress model
snapshot upgrades before cleaning up job data, and deleting an
anomaly detection job cancels any in-progress model snapshot
upgrades associated with that job before cleaning up the job's
data.

Fixes #81578

If a model snapshot upgrade persistent task is cancelled then
we now kill any associated C++ process. Previously the C++ process
could hang indefinitely.

Additionally, ML feature reset now cancels any in-progress model
snapshot upgrades before cleaning up job data, and deleting an
anomaly detection job cancels any in-progress model snapshot
upgrades associated with that job before cleaning up the job's
data.

Fixes elastic#81578
@droberts195 droberts195 marked this pull request as ready for review December 17, 2021 19:30
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Dec 17, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than my comments, this looks good go to.

Comment on lines 51 to 52
PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID);
PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID);
PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID);
PARSER.declareString(Request::setJobId, Job.ID);
PARSER.declareString(Request::setSnapshotId, SNAPSHOT_ID);

This seems OK as we are not allowing null in the parser declarations

public class CancelJobModelSnapshotUpgradeAction extends ActionType<CancelJobModelSnapshotUpgradeAction.Response> {

public static final CancelJobModelSnapshotUpgradeAction INSTANCE = new CancelJobModelSnapshotUpgradeAction();
public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel";
public static final String NAME = "cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel";

Since there is no rest layer for this, it seems prudent to be internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intentionally made this admin instead of internal. Although there's no REST layer today it's possible that we might need to add one in the future, say if we ever want the UI to be able to clean up a failed upgrade.

The reason we've used internal instead of admin in the past is to make it hard for users of the transport client to call the API, even though it's easy for them to write the code to call it. Since there's nothing inherently wrong with calling this endpoint if you want to clean up failed upgrades I don't think that's a problem in this case. (Plus it's only in 7.17 that the transport client will even exist - after that lack of a REST layer will make this impossible to call.) I will add a comment to say this.

Part of the reason I didn't add a REST layer in this PR was because it's being added so late in the 7.x development cycle and it's not strictly essential at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internal conveys a message even if just a convention. If this was to become a public API with a REST interface we would create a new admin wrapper action that calls this one and have this action internal. We have many internal APIs called this way.

I don't consider this a blocker on the PR but there is a path for making internal APIs public, if you think it likely there will be a REST API then keep it admin otherwise it makes more sense to be internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's 50/50 whether this will be made public in the future.

If this was to become a public API with a REST interface we would create a new admin wrapper action that calls this one and have this action internal.

This seems like needless baggage to me. The wrapper action would literally call the internal action and do nothing else. It would just be extra boilerplate code.

internal conveys a message even if just a convention.

It's only the ML team that has ever done this. All the internal actions of core Elasticsearch start with either admin or monitor. (For example look at the persistent tasks actions.) The original reason why the ML team did this was to make transport client users think twice before calling an API that was intended to be internal. Of course this only had an effect if security was enabled, but it was in the days of X-Pack as a separate plugin where ML users were likely to be using security but most users weren't using security. It was also in the days before the transport client was deprecated so we were more worried about transport client users accidentally automating things they shouldn't.

Be as I said in the comment I added to the code, in this case we wouldn't really care if a transport client user did call this action. It doesn't do something low level that will mess up the cluster if called at the wrong time. It's an action that could be external, just isn't at the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that as the transport client is not used directly any more we don't really gain much by calling the action internal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its good we had this discussion then as internal is a convention we have been using for a long time. If we chose to stop doing this then we have already established the usage of naming internal actions internal e.g. InternalInferModelAction perhaps we can switch to that convention for actions that are purely internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key difference with this action is that it doesn't need to be internal. It should be possible to make it external in the future with as little extra effort as possible. Whereas InternalInferModelAction is specifically designed to be internal, and it would cause problems if people ever started calling it externally. An action that's more similar to this one is DeleteExpiredDataAction, where its primary purpose is to be called by the daily maintenance task, but it doesn't hurt to call it externally. DeleteExpiredDataAction does have a corresponding REST endpoint, but if we'd wanted to avoid that initially to avoid yet another endpoint that shows up in our list of documented endpoints then we could have deferred adding the REST endpoint to a release after the action was added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only the ML team that has ever done this. All the internal actions of core Elasticsearch start with either admin or monitor.

I agree that as the transport client is not used directly any more we don't really gain much by calling the action internal.

I was replying to these points stating that the original reason for using internal no longer applies and that and suggesting it can be discontinued with the demise of the transport client.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was replying to these points stating that the original reason for using internal no longer applies and that and suggesting it can be discontinued with the demise of the transport client.

Yes, sure, for new actions we can discontinue it. But for the existing ones that contain internal changing it is a non-trivial piece of work, so we might as well leave them as-is.

@@ -152,6 +152,7 @@
"cluster:admin/xpack/ml/job/model_snapshots/revert",
"cluster:admin/xpack/ml/job/model_snapshots/update",
"cluster:admin/xpack/ml/job/model_snapshots/upgrade",
"cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel",
"cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel",

super(NAME, Response::new);
}

public static class Request extends BaseTasksRequest<Request> implements ToXContentObject {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request is a BaseTasksRequest yet the transport action is HandledTransportAction not a TransportTasksAction.

Suggested change
public static class Request extends BaseTasksRequest<Request> implements ToXContentObject {
public static class Request extends ActionRequest<Request> implements ToXContentObject {

// cannot be as thorough.
List<PersistentTasksCustomMetadata.PersistentTask<?>> upgradeTasksToCancel = MlTasks.snapshotUpgradeTasks(tasksInProgress)
.stream()
.filter(t -> jobIds.contains(((SnapshotUpgradeTaskParams) t.getParams()).getJobId()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a SimpleIdsMatcher for job IDs and save a call to expandJobs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expandJobs handles allow_no_match. It's impossible in this action for allow_no_match to work with the snapshot IDs, because the upgrade tasks are ephemeral and stopping a stopped thing is not an error. But allow_no_match can work with the job IDs.

This also ties in with #81831 (comment) because if we were only ever going to use this action internally then it wouldn't be so important to support allow_no_match at all. But I can imagine in the future we might want to call this action from an enhanced UI. We'll need to add a REST endpoint at that time and it's nice if the machinery of allow_no_match is available. The way allow_no_match is handled for jobs in this action is the same as in the recently added model snapshot upgrades stats action.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I've done is as sensible as it can be given the way the persistent tasks appear and disappear.

Consider these scenarios:

1

  • User asks for stats for job foo, upgrade of snapshot 1
  • Job foo doesn't exist
  • We return not found
  • This isn't controversial

2

  • User asks for stats for job foo, upgrade of snapshot 1
  • Job foo exists but snapshot 1 isn't being upgraded
  • We return not found
  • This one is potentially controversial, as the snapshot upgrade might only just have finished, but since it's not running we cannot return any stats. The only things we could do to make this better would be (1) persist stats for snapshot upgrades that have finished or (2) make upgrading a snapshot into a multi-step process where you first open the upgrade, then later close it (like you do for jobs). Both of these things bring considerable extra baggage to the upgrade process.

3

  • User asks for stats for job foo, upgrade of snapshot 1
  • Job foo exists and snapshot 1 is being upgraded
  • We return the stats
  • This isn't controversial

4

  • User asks to cancel upgrade of snapshot 1 for job foo
  • Job foo doesn't exist
  • We return not found
  • This isn't controversial

5

  • User asks to cancel upgrade of snapshot 1 for job foo
  • Job foo exists but snapshot 1 isn't being upgraded
  • We do nothing and return success
  • This one is potentially controversial, but the end state is what the user wanted and the snapshot upgrade might have just finished in between the user listing the upgrades in progress and asking to cancel this one. This is also consistent with closing a closed job is not an error and stopping a stopped datafeed is not an error.

6

  • User asks to cancel upgrade of snapshot 1 for job foo
  • Job foo exists and snapshot 1 is being upgraded
  • We cancel the upgrade and return success
  • This isn't controversial

I don't think there's a better way.

@droberts195 droberts195 merged commit 555c5b4 into elastic:master Dec 22, 2021
@droberts195 droberts195 deleted the improve_model_snapshot_upgrade_cleanup branch December 22, 2021 11:03
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Dec 22, 2021
If a model snapshot upgrade persistent task is cancelled then
we now kill any associated C++ process. Previously the C++ process
could hang indefinitely.

Additionally, ML feature reset now cancels any in-progress model
snapshot upgrades before cleaning up job data, and deleting an
anomaly detection job cancels any in-progress model snapshot
upgrades associated with that job before cleaning up the job's
data.

Backport of elastic#81831
elasticsearchmachine pushed a commit that referenced this pull request Dec 22, 2021
* [7.17] [ML] Improve cleanup for model snapshot upgrades

If a model snapshot upgrade persistent task is cancelled then
we now kill any associated C++ process. Previously the C++ process
could hang indefinitely.

Additionally, ML feature reset now cancels any in-progress model
snapshot upgrades before cleaning up job data, and deleting an
anomaly detection job cancels any in-progress model snapshot
upgrades associated with that job before cleaning up the job's
data.

Backport of #81831

* Formatting
droberts195 added a commit that referenced this pull request Dec 22, 2021
If a model snapshot upgrade persistent task is cancelled then
we now kill any associated C++ process. Previously the C++ process
could hang indefinitely.

Additionally, ML feature reset now cancels any in-progress model
snapshot upgrades before cleaning up job data, and deleting an
anomaly detection job cancels any in-progress model snapshot
upgrades associated with that job before cleaning up the job's
data.

Backport of #81831
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v7.17.0 v8.0.0-rc1 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Model snapshot upgrade processes need better cleanup/cancellation functionality
6 participants