[ML] Wait for model process to stop in stop deployment #83644

davidkyle · 2022-02-08T11:28:07Z

When stopping a deployment the tasks API was used to wait for the model task to finish but the action used in the request did not match the model task action so the request would return success without waiting. The code would then continue and delete the model allocation. Deleting the allocation would be noticed on the node running the task and if the task had not stopped yet it would then stop the task resulting in a double stop and an error being logged as this is unexpected. The bug is minor and it only manifests in the log files, the stop deployment is still successful.

The fix is to add a listener to the task stop API that responds only after the task is stopped.

elasticmachine · 2022-02-08T11:28:10Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2022-02-08T11:28:31Z

Hi @davidkyle, I've created a changelog YAML for you.

benwtrent · 2022-02-08T12:27:20Z

...rc/main/java/org/elasticsearch/xpack/ml/inference/deployment/TrainedModelDeploymentTask.java

@@ -80,15 +82,11 @@ public TaskParams getParams() {
        return params;
    }

-    public void stop(String reason) {
-        logger.debug("[{}] Stopping due to reason [{}]", getModelId(), reason);
-        licensedFeature.stopTracking(licenseState, "model-" + params.getModelId());


I think this still needs to be called. If you are concerned, maybe wrap the listener and call this on response/failure?

Yeah I'm not happy with this pattern. The task asks the node service to stop then the node service calls back to TrainedModelDeploymentTask::markAsStopped from stopDeploymentAsync. This means markAsStopped is a public method which doesn't make much sense as part of the public API.

The problem is there are a few ways TrainedModelAllocationNodeService::stopDeploymentAsync can be called such as the service noticing the deployment has been deleted or the task being cancelled.

I deleted these lines because markAsStopped (formerly stopWithoutNotification) was being called anyway and the same work occurs there.

I deleted these lines because markAsStopped (formerly stopWithoutNotification) was being called anyway and the same work occurs there.

Gotcha, we just need to be careful and make sure that stopTracking is being called.

benwtrent · 2022-02-08T12:32:43Z

...c/main/java/org/elasticsearch/xpack/ml/action/TransportStopTrainedModelDeploymentAction.java

-        listener.onResponse(new StopTrainedModelDeploymentAction.Response(true));
+        task.stop(
+            "undeploy_trained_model (api)",
+            ActionListener.wrap(r -> listener.onResponse(new StopTrainedModelDeploymentAction.Response(true)), listener::onFailure)


This is a much cleaner and the execution path is more easily read.

benwtrent · 2022-02-08T12:33:17Z

...c/main/java/org/elasticsearch/xpack/ml/action/TransportStopTrainedModelDeploymentAction.java

-            .prepareListTasks(nodesOfConcern.toArray(String[]::new))
-            .setDetailed(true)
-            .setWaitForCompletion(true)
-            .setActions(modelId)


I think this bug was added in: #81259

The actions used to contain the model id. Regardless, the new stopping path is much cleaner.

elasticsearchmachine · 2022-02-08T14:37:49Z

💚 Backport successful

Status	Branch	Result
✅	8.1

* upstream/master: (166 commits) Bind host all instead of just _site_ when needed (elastic#83145) [DOCS] Fix min/max agg snippets for histograms (elastic#83695) [DOCS] Add deprecation notice for system indices (elastic#83688) Cache ILM policy name on IndexMetadata (elastic#83603) [DOCS] Fix 8.0 breaking changes sort order (elastic#83685) [ML] fix random sampling background query consistency (elastic#83676) Move internal APIs into their own namespace '_internal' Runtime fields core-with-mapped tests support tsdb (elastic#83577) Optimize calculating the presence of a quorum (elastic#83638) Use switch expressions in EnableAllocationDecider and NodeShutdownAllocationDecider (elastic#83641) Note libffi error message in tmpdir docs (elastic#83662) Fix TransportDesiredNodesActionsIT batch tests (elastic#83659) [DOCS] Remove unused upgrade doc files (elastic#83617) [ML] Wait for model process to stop in stop deployment (elastic#83644) [ML] Fix submit after shutdown in process worker service (elastic#83645) Remove req/resp classes associated with HLRC (elastic#83599) Introduce index.version.compatibility setting (elastic#83264) Rename InternalTestCluster#getMasterNodeInstance (elastic#83407) Mute TimeSeriesIndexSearcherTests testCollectInOrderAcrossSegments (elastic#83648) Add rollover add max_primary_shard_docs condition (elastic#80981) ... # Conflicts: # x-pack/plugin/rollup/build.gradle # x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/v2/RollupActionSingleNodeTests.java

Wait for model process to be stopped

4fb8a1e

davidkyle added >bug :ml Machine learning v8.2.0 v8.1.1 labels Feb 8, 2022

elasticmachine added the Team:ML Meta label for the ML team label Feb 8, 2022

davidkyle added 2 commits February 8, 2022 11:28

Update docs/changelog/83644.yaml

ead8355

rename method

d19c993

benwtrent self-requested a review February 8, 2022 12:19

benwtrent reviewed Feb 8, 2022

View reviewed changes

davidkyle added auto-backport-and-merge v8.1.0 and removed v8.1.1 labels Feb 8, 2022

davidkyle changed the title ~~[ML] Wait for model process to be stop in stop deployment~~ [ML] Wait for model process to stop in stop deployment Feb 8, 2022

benwtrent approved these changes Feb 8, 2022

View reviewed changes

davidkyle merged commit 2200fa7 into elastic:master Feb 8, 2022

davidkyle mentioned this pull request Feb 8, 2022

[8.1] [ML] Wait for model process to stop in stop deployment (#83644) #83657

Merged

davidkyle added a commit to davidkyle/elasticsearch that referenced this pull request Feb 8, 2022

[ML] Wait for model process to stop in stop deployment (elastic#83644)

83b229f

elasticsearchmachine pushed a commit that referenced this pull request Feb 8, 2022

[ML] Wait for model process to stop in stop deployment (#83644) (#83657)

bf94542

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Wait for model process to stop in stop deployment #83644

[ML] Wait for model process to stop in stop deployment #83644

Uh oh!

davidkyle commented Feb 8, 2022

Uh oh!

elasticmachine commented Feb 8, 2022

Uh oh!

elasticsearchmachine commented Feb 8, 2022

Uh oh!

benwtrent Feb 8, 2022

Uh oh!

davidkyle Feb 8, 2022

Uh oh!

benwtrent Feb 8, 2022

Uh oh!

benwtrent Feb 8, 2022

Uh oh!

benwtrent Feb 8, 2022

Uh oh!

elasticsearchmachine commented Feb 8, 2022

Uh oh!

Uh oh!

[ML] Wait for model process to stop in stop deployment #83644

[ML] Wait for model process to stop in stop deployment #83644

Uh oh!

Conversation

davidkyle commented Feb 8, 2022

Uh oh!

elasticmachine commented Feb 8, 2022

Uh oh!

elasticsearchmachine commented Feb 8, 2022

Uh oh!

benwtrent Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

davidkyle Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

benwtrent Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

benwtrent Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

benwtrent Feb 8, 2022

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Feb 8, 2022

💚 Backport successful

Uh oh!

Uh oh!