[ML] Improve cleanup for model snapshot upgrades #81831

droberts195 · 2021-12-16T17:13:50Z

If a model snapshot upgrade persistent task is cancelled then
we now kill any associated C++ process. Previously the C++ process
could hang indefinitely.

Additionally, ML feature reset now cancels any in-progress model
snapshot upgrades before cleaning up job data, and deleting an
anomaly detection job cancels any in-progress model snapshot
upgrades associated with that job before cleaning up the job's
data.

Fixes #81578

If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Fixes elastic#81578

elasticmachine · 2021-12-17T19:30:33Z

Pinging @elastic/ml-core (Team:ML)

benwtrent

Other than my comments, this looks good go to.

benwtrent · 2021-12-17T20:10:48Z

...rc/main/java/org/elasticsearch/xpack/core/ml/action/CancelJobModelSnapshotUpgradeAction.java

+            PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID);
+            PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID);


Suggested change

PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID);

PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID);

PARSER.declareString(Request::setJobId, Job.ID);

PARSER.declareString(Request::setSnapshotId, SNAPSHOT_ID);

This seems OK as we are not allowing null in the parser declarations

benwtrent · 2021-12-17T20:26:36Z

...rc/main/java/org/elasticsearch/xpack/core/ml/action/CancelJobModelSnapshotUpgradeAction.java

+public class CancelJobModelSnapshotUpgradeAction extends ActionType<CancelJobModelSnapshotUpgradeAction.Response> {
+
+    public static final CancelJobModelSnapshotUpgradeAction INSTANCE = new CancelJobModelSnapshotUpgradeAction();
+    public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel";


Suggested change

public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel";

public static final String NAME = "cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel";

Since there is no rest layer for this, it seems prudent to be internal.

I intentionally made this admin instead of internal. Although there's no REST layer today it's possible that we might need to add one in the future, say if we ever want the UI to be able to clean up a failed upgrade.

The reason we've used internal instead of admin in the past is to make it hard for users of the transport client to call the API, even though it's easy for them to write the code to call it. Since there's nothing inherently wrong with calling this endpoint if you want to clean up failed upgrades I don't think that's a problem in this case. (Plus it's only in 7.17 that the transport client will even exist - after that lack of a REST layer will make this impossible to call.) I will add a comment to say this.

Part of the reason I didn't add a REST layer in this PR was because it's being added so late in the 7.x development cycle and it's not strictly essential at the moment.

internal conveys a message even if just a convention. If this was to become a public API with a REST interface we would create a new admin wrapper action that calls this one and have this action internal. We have many internal APIs called this way.

I don't consider this a blocker on the PR but there is a path for making internal APIs public, if you think it likely there will be a REST API then keep it admin otherwise it makes more sense to be internal.

I think it's 50/50 whether this will be made public in the future.

If this was to become a public API with a REST interface we would create a new admin wrapper action that calls this one and have this action internal.

This seems like needless baggage to me. The wrapper action would literally call the internal action and do nothing else. It would just be extra boilerplate code.

internal conveys a message even if just a convention.

It's only the ML team that has ever done this. All the internal actions of core Elasticsearch start with either admin or monitor. (For example look at the persistent tasks actions.) The original reason why the ML team did this was to make transport client users think twice before calling an API that was intended to be internal. Of course this only had an effect if security was enabled, but it was in the days of X-Pack as a separate plugin where ML users were likely to be using security but most users weren't using security. It was also in the days before the transport client was deprecated so we were more worried about transport client users accidentally automating things they shouldn't.

Be as I said in the comment I added to the code, in this case we wouldn't really care if a transport client user did call this action. It doesn't do something low level that will mess up the cluster if called at the wrong time. It's an action that could be external, just isn't at the moment.

I agree that as the transport client is not used directly any more we don't really gain much by calling the action internal.

Its good we had this discussion then as internal is a convention we have been using for a long time. If we chose to stop doing this then we have already established the usage of naming internal actions internal e.g. InternalInferModelAction perhaps we can switch to that convention for actions that are purely internal.

The key difference with this action is that it doesn't need to be internal. It should be possible to make it external in the future with as little extra effort as possible. Whereas InternalInferModelAction is specifically designed to be internal, and it would cause problems if people ever started calling it externally. An action that's more similar to this one is DeleteExpiredDataAction, where its primary purpose is to be called by the daily maintenance task, but it doesn't hurt to call it externally. DeleteExpiredDataAction does have a corresponding REST endpoint, but if we'd wanted to avoid that initially to avoid yet another endpoint that shows up in our list of documented endpoints then we could have deferred adding the REST endpoint to a release after the action was added.

It's only the ML team that has ever done this. All the internal actions of core Elasticsearch start with either admin or monitor.

I agree that as the transport client is not used directly any more we don't really gain much by calling the action internal.

I was replying to these points stating that the original reason for using internal no longer applies and that and suggesting it can be discontinued with the demise of the transport client.

I was replying to these points stating that the original reason for using internal no longer applies and that and suggesting it can be discontinued with the demise of the transport client.

Yes, sure, for new actions we can discontinue it. But for the existing ones that contain internal changing it is a non-trivial piece of work, so we might as well leave them as-is.

benwtrent · 2021-12-17T20:26:49Z

...vileges-tests/src/javaRestTest/java/org/elasticsearch/xpack/security/operator/Constants.java

@@ -152,6 +152,7 @@
        "cluster:admin/xpack/ml/job/model_snapshots/revert",
        "cluster:admin/xpack/ml/job/model_snapshots/update",
        "cluster:admin/xpack/ml/job/model_snapshots/upgrade",
+        "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel",


Suggested change

"cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel",

"cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel",

davidkyle · 2021-12-20T14:57:48Z

...rc/main/java/org/elasticsearch/xpack/core/ml/action/CancelJobModelSnapshotUpgradeAction.java

+        super(NAME, Response::new);
+    }
+
+    public static class Request extends BaseTasksRequest<Request> implements ToXContentObject {


The request is a BaseTasksRequest yet the transport action is HandledTransportAction not a TransportTasksAction.

Suggested change

public static class Request extends BaseTasksRequest<Request> implements ToXContentObject {

public static class Request extends ActionRequest<Request> implements ToXContentObject {

davidkyle · 2021-12-20T15:51:54Z

...ain/java/org/elasticsearch/xpack/ml/action/TransportCancelJobModelSnapshotUpgradeAction.java

+            // cannot be as thorough.
+            List<PersistentTasksCustomMetadata.PersistentTask<?>> upgradeTasksToCancel = MlTasks.snapshotUpgradeTasks(tasksInProgress)
+                .stream()
+                .filter(t -> jobIds.contains(((SnapshotUpgradeTaskParams) t.getParams()).getJobId()))


Why not use a SimpleIdsMatcher for job IDs and save a call to expandJobs

expandJobs handles allow_no_match. It's impossible in this action for allow_no_match to work with the snapshot IDs, because the upgrade tasks are ephemeral and stopping a stopped thing is not an error. But allow_no_match can work with the job IDs.

This also ties in with #81831 (comment) because if we were only ever going to use this action internally then it wouldn't be so important to support allow_no_match at all. But I can imagine in the future we might want to call this action from an enhanced UI. We'll need to add a REST endpoint at that time and it's nice if the machinery of allow_no_match is available. The way allow_no_match is handled for jobs in this action is the same as in the recently added model snapshot upgrades stats action.

There is a difference to the GET stats API which enforces that the snapshot IDs are also matched.

https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetJobModelSnapshotsUpgradeStatsAction.java#L110

I think what I've done is as sensible as it can be given the way the persistent tasks appear and disappear.

Consider these scenarios:

1

User asks for stats for job foo, upgrade of snapshot 1

Job foo doesn't exist

We return not found

This isn't controversial

2

User asks for stats for job foo, upgrade of snapshot 1

Job foo exists but snapshot 1 isn't being upgraded

We return not found

This one is potentially controversial, as the snapshot upgrade might only just have finished, but since it's not running we cannot return any stats. The only things we could do to make this better would be (1) persist stats for snapshot upgrades that have finished or (2) make upgrading a snapshot into a multi-step process where you first open the upgrade, then later close it (like you do for jobs). Both of these things bring considerable extra baggage to the upgrade process.

3

User asks for stats for job foo, upgrade of snapshot 1

Job foo exists and snapshot 1 is being upgraded

We return the stats

This isn't controversial

4

User asks to cancel upgrade of snapshot 1 for job foo

Job foo doesn't exist

We return not found

This isn't controversial

5

User asks to cancel upgrade of snapshot 1 for job foo

Job foo exists but snapshot 1 isn't being upgraded

We do nothing and return success

This one is potentially controversial, but the end state is what the user wanted and the snapshot upgrade might have just finished in between the user listing the upgrades in progress and asking to cancel this one. This is also consistent with closing a closed job is not an error and stopping a stopped datafeed is not an error.

6

User asks to cancel upgrade of snapshot 1 for job foo

Job foo exists and snapshot 1 is being upgraded

We cancel the upgrade and return success

This isn't controversial

I don't think there's a better way.

If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of elastic#81831

* [7.17] [ML] Improve cleanup for model snapshot upgrades If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of #81831 * Formatting

If a model snapshot upgrade persistent task is cancelled then we now kill any associated C++ process. Previously the C++ process could hang indefinitely. Additionally, ML feature reset now cancels any in-progress model snapshot upgrades before cleaning up job data, and deleting an anomaly detection job cancels any in-progress model snapshot upgrades associated with that job before cleaning up the job's data. Backport of #81831

droberts195 added >bug :ml Machine learning v8.0.0 v8.1.0 v7.17.0 labels Dec 16, 2021

droberts195 added 3 commits December 16, 2021 18:16

Fixing operator privileges test

c95b6da

Merge branch 'master' into improve_model_snapshot_upgrade_cleanup

1e5bb25

Tweaks

b50a83b

droberts195 marked this pull request as ready for review December 17, 2021 19:30

elasticmachine added the Team:ML Meta label for the ML team label Dec 17, 2021

benwtrent approved these changes Dec 17, 2021

View reviewed changes

davidkyle reviewed Dec 20, 2021

View reviewed changes

droberts195 added 3 commits December 20, 2021 17:44

Addressing review comments

a21f9a8

Remove unused import

dee2707

Merge branch 'master' into improve_model_snapshot_upgrade_cleanup

9c4112e

droberts195 merged commit 555c5b4 into elastic:master Dec 22, 2021

droberts195 deleted the improve_model_snapshot_upgrade_cleanup branch December 22, 2021 11:03

This was referenced Dec 22, 2021

[8.0] [ML] Improve cleanup for model snapshot upgrades #82021

Merged

[7.17] [ML] Improve cleanup for model snapshot upgrades #82022

Merged

droberts195 mentioned this pull request Dec 22, 2021

[ML] Model snapshot upgrade needs a stats endpoint #81641

Merged

mark-vieira added v8.0.0-rc1 and removed v8.0.0 labels Jan 12, 2022

		PARSER.declareString((request, jobId) -> request.jobId = jobId, Job.ID);
		PARSER.declareString((request, snapshotId) -> request.snapshotId = snapshotId, SNAPSHOT_ID);

	public static final String NAME = "cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel";
	public static final String NAME = "cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel";

	"cluster:admin/xpack/ml/job/model_snapshots/upgrade/cancel",
	"cluster:internal/xpack/ml/job/model_snapshots/upgrade/cancel",

	public static class Request extends BaseTasksRequest<Request> implements ToXContentObject {
	public static class Request extends ActionRequest<Request> implements ToXContentObject {

[ML] Improve cleanup for model snapshot upgrades #81831

[ML] Improve cleanup for model snapshot upgrades #81831

Uh oh!

Conversation

droberts195 commented Dec 16, 2021

Uh oh!

elasticmachine commented Dec 17, 2021

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

1

2

3

4

5

6

Uh oh!

Uh oh!