[ML] Adding new trained model allocation service #75778

benwtrent · 2021-07-28T11:27:15Z

Adds a new service for trained model allocation to nodes.

Initially, this only supports PyTorch models and simply allocates
to nodes with the ML roles.

Design is fairly simple:

A master node service runs allowing for new allocations to be created/updated/deleted from cluster state
A node service runs listening to updates referencing the local node + any models it may have allocated and updates accordingly.

This type of service sort of splits the difference between the logic of shard allocation and persistent tasks. Neither really fully addressed the need here.

elasticmachine · 2021-07-28T11:27:19Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-07-28T11:28:38Z

...-node-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/PyTorchModelIT.java

+            "\"transient\" : {\n" +
+            "        \"logger.org.elasticsearch.xpack.ml.inference.allocation\" : \"TRACE\",\n" +
+            "        \"logger.org.elasticsearch.xpack.ml.inference.deployment\" : \"TRACE\"\n" +
+            "    }" +


This is nice and I think should be kept for now as it has uncovered some weird conditions while running

benwtrent · 2021-07-28T11:29:33Z

...main/java/org/elasticsearch/xpack/ml/action/TransportCreateTrainedModelAllocationAction.java

+        this.trainedModelAllocationClusterService = trainedModelAllocationClusterService;
+        // Here we create our singleton for the node service
+        clusterService.addListener(
+            new TrainedModelAllocationNodeService(
+                trainedModelAllocationService,
+                clusterService,
+                deploymentManager,
+                transportService.getTaskManager(),
+                threadPool
+            )


This is pretty much what persistent tasks do. Its ok as transport classes are singletons and are created on every node where the plugin is loaded.

benwtrent · 2021-07-28T11:30:57Z

.../main/java/org/elasticsearch/xpack/ml/action/TransportInferTrainedModelDeploymentAction.java

+        // TODO Do better routing for inference calls
+        int nodeIndex = Randomness.get().nextInt(randomRunningNode.length);
+        request.setNodes(randomRunningNode[nodeIndex]);
+        super.doExecute(task, request, listener);


because we do tasks, and still use the object TrainedModelDeploymentTask all we need to do is set what node we care about and the internal routing does the rest (filtering to the correct node, finding the right task, allowing us to infer against it).

++ I like how simple this has turned out

.../main/java/org/elasticsearch/xpack/ml/action/TransportStartTrainedModelDeploymentAction.java

benwtrent · 2021-07-28T11:32:23Z

.../main/java/org/elasticsearch/xpack/ml/action/TransportStartTrainedModelDeploymentAction.java

+                if (RoutingState.FAILED.equals(nodeIdAndState.getValue().getState())) {
+                    nodeFailuresAndReasons.put(nodeIdAndState.getKey(), nodeIdAndState.getValue().getReason());


If any fail, they all fail. This can be easily changed later. We may want to support "partial allocations"

It seems to me that it would be beneficial to our users to allow flexibility here and support partial allocations. But I'm happy to deal with this after this PR is merged. Let's raise an issue though so we don't forget.

.../java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationNodeService.java

benwtrent · 2021-07-28T11:51:35Z

.../java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationNodeService.java

+        // TODO: Do we want to remove from the modelIdToTask map? This would cause it to be reloaded by state updates on INITIALIZING
+        modelIdToTask.remove(task.getModelId());


If the node state in the cluster state somehow got set back to initializing (of which there is no way now other than recreating the allocation or deleting the route), we would start the model again, which may be exactly what we want.

benwtrent · 2021-07-28T11:52:35Z

...main/java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationService.java

+        client.execute(UpdateTrainedModelAllocationStateAction.INSTANCE, request, ActionListener.wrap(listener::onResponse, failure -> {
+            if (isMasterChannelException(failure)) {
+                logger.info(
+                    "[{}] master channel exception will retry on new master node for allocation state update [{}]",
+                    request.getModelId(),
+                    request.getRoutingState().getState()
+                );
+                waitForNewMasterAndRetry(observer, UpdateTrainedModelAllocationStateAction.INSTANCE, request, listener, changePredicate);
+                return;
+            }
+            listener.onFailure(failure);
+        }));


for table updates, this is important. We don't want a spurious failure to cause the routing table to be stale and cause inference issues. So, if the failure has to do with some intermittent master node communication issue, we should retry.

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/deployment/DeploymentManager.java

benwtrent · 2021-07-28T11:54:42Z

...ugin/ml/src/main/java/org/elasticsearch/xpack/ml/inference/deployment/DeploymentManager.java

@@ -176,7 +168,7 @@ public void stopDeployment(TrainedModelDeploymentTask task) {
    public void infer(TrainedModelDeploymentTask task,
                      String input, TimeValue timeout,
                      ActionListener<InferenceResults> listener) {
-        ProcessContext processContext = processContextByAllocation.get(task.getAllocationId());
+        ProcessContext processContext = processContextByAllocation.get(task.getId());


task IDs are monotonically increasing for the life time of the TaskManager, so this is safe.

…d-model-allocation-service

dimitris-athanasiou

Posting those comments so they don't get lost. I'm half way so I'll come back with more. Also, some of the comments I've made might make no sense once I understand this fully so bear with me.

dimitris-athanasiou · 2021-07-29T13:38:10Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/allocation/TrainedModelAllocation.java

+        );
+    }
+
+    private final StartTrainedModelDeploymentAction.TaskParams taskParams;


I wonder if it makes sense to keep these as TaskParams. I would consider getting rid of that object and have the model id, index and model bytes as flat variables in this class.

Alternatively, those could be renamed to ModelParams or something like that, if we intend to capture information about the model into a single object.

FWIW, these params are still used to create a Task object. Though, I do think maybe bringing them out of StartTrainedModelDeploymentAction makes sense.

...c/main/java/org/elasticsearch/xpack/core/ml/inference/allocation/TrainedModelAllocation.java

dimitris-athanasiou · 2021-07-29T14:56:07Z

.../main/java/org/elasticsearch/xpack/ml/action/TransportInferTrainedModelDeploymentAction.java

+        // TODO Do better routing for inference calls
+        int nodeIndex = Randomness.get().nextInt(randomRunningNode.length);
+        request.setNodes(randomRunningNode[nodeIndex]);
+        super.doExecute(task, request, listener);


.../main/java/org/elasticsearch/xpack/ml/action/TransportStartTrainedModelDeploymentAction.java

dimitris-athanasiou · 2021-07-29T15:02:45Z

.../main/java/org/elasticsearch/xpack/ml/action/TransportStartTrainedModelDeploymentAction.java

+                if (RoutingState.FAILED.equals(nodeIdAndState.getValue().getState())) {
+                    nodeFailuresAndReasons.put(nodeIdAndState.getKey(), nodeIdAndState.getValue().getReason());


It seems to me that it would be beneficial to our users to allow flexibility here and support partial allocations. But I'm happy to deal with this after this PR is merged. Let's raise an issue though so we don't forget.

...c/main/java/org/elasticsearch/xpack/ml/action/TransportStopTrainedModelDeploymentAction.java

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

…d-model-allocation-service

dimitris-athanasiou · 2021-07-30T11:26:55Z

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

+
+                @Override
+                public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
+                    logger.trace("updated model allocations based on node changes in the cluster");


Shall we also print the new routing table here? I think it would be useful in debug.

dimitris-athanasiou · 2021-07-30T11:30:45Z

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

+                    // TODO this has a weird side-effect for allocating to nodes
+                    // If the event indicates there were nodes added/removed, this method only looks at the current state and has
+                    // no previous knowledge of existing nodes. Consequently, if a model was manually removed (task-kill) from a node
+                    // it may get re-allocated to that node when another node is added/removed...
+                    return addRemoveAllocationNodes(currentState);


I don't think we should worry about this for now. If a user manually cancels a task on a node, we can only expect they've done this during troubleshooting with our support and guidance. The idea of not then allowing the model to be allocated back to that node is a new feature that would cover the requirement to "manually designate suitable nodes for a model to be allocated on". We do not have that requirement currently.

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

...ain/java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationMetadata.java

.../src/main/java/org/elasticsearch/xpack/core/ml/action/StartTrainedModelDeploymentAction.java

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

davidkyle

The design is good and I can see how advanced features would be built on this.
In many ways it is a simplification as a lot of code is removed

...in/core/src/main/java/org/elasticsearch/xpack/core/ml/inference/allocation/RoutingState.java

davidkyle · 2021-07-30T12:22:50Z

...c/main/java/org/elasticsearch/xpack/core/ml/inference/allocation/TrainedModelAllocation.java

+            this.taskParams = taskParams;
+        }
+
+        public Builder addNewRoutingEntry(String nodeId) {


I don't understand how you can get to FAILED without going through INTIALIZING first.

...-node-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/PyTorchModelIT.java

.../main/java/org/elasticsearch/xpack/ml/action/TransportInferTrainedModelDeploymentAction.java

davidkyle · 2021-07-30T13:09:24Z

.../main/java/org/elasticsearch/xpack/ml/action/TransportInferTrainedModelDeploymentAction.java

+        // TODO Do better routing for inference calls
+        int nodeIndex = Randomness.get().nextInt(randomRunningNode.length);
+        request.setNodes(randomRunningNode[nodeIndex]);
+        super.doExecute(task, request, listener);


++ I like how simple this has turned out

davidkyle · 2021-07-30T13:23:39Z

.../main/java/org/elasticsearch/xpack/ml/action/TransportStartTrainedModelDeploymentAction.java

-            PersistentTasksCustomMetadata.Assignment assignment = persistentTask.getAssignment();
-
-            String reason = "__unknown__";
+            final Set<Map.Entry<String, RoutingStateAndReason>> nodesAndState = trainedModelAllocation


This and the logic below to get failed nodes could be a member method on TrainedModelAllocation. Easier to test and reuse.

I couldn't figure a nice way to get them both in a nice loop (with routing reasons intact).

davidkyle · 2021-07-30T13:50:35Z

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

+                    // TODO this has a weird side-effect for allocating to nodes
+                    // If the event indicates there were nodes added/removed, this method only looks at the current state and has
+                    // no previous knowledge of existing nodes. Consequently, if a model was manually removed (task-kill) from a node
+                    // it may get re-allocated to that node when another node is added/removed...
+                    return addRemoveAllocationNodes(currentState);


I would argue 'works as intended' while allocations are all nodes or none at least.

...ain/java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationMetadata.java

...main/java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationService.java

…d-model-allocation-service

benwtrent · 2021-08-02T15:26:18Z

@elasticmachine update branch

dimitris-athanasiou · 2021-08-02T15:28:28Z

...c/main/java/org/elasticsearch/xpack/ml/action/TransportStopTrainedModelDeploymentAction.java

@@ -101,7 +102,17 @@ protected void doExecute(Task task, StopTrainedModelDeploymentAction.Request req
                    listener.onResponse(new StopTrainedModelDeploymentAction.Response(true));
                    return;
                }
-                normalUndeploy(task, models.get(0).getModelId(), maybeAllocation.get(), request, listener);
+                final String modelId = models.get(0).getModelId();
+                trainedModelAllocationService.stopModelAllocation(modelId, ActionListener.wrap(


Instead of adding a whole new action for this, would it be an option to call the update allocation action and update its state to stopping through that?

would it be an option to call the update allocation action and update its state to stopping through that?

Maybe, but there is no "update allocation" action. The only updates that occur throughout the life time of the allocation are route updates.

This is the one I meant UpdateTrainedModelAllocationStateAction

UpdateTrainedModelAllocationStateAction

The focus of that request is only routes. Possibly I should rename it, but right now that action is only requested by the node service to update a route in the cluster service.

I updated the deployment stop action to update the allocation state to stopping

dimitris-athanasiou · 2021-08-02T15:32:11Z

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

@@ -249,6 +253,10 @@ static ClusterState updateModelRoutingTable(ClusterState currentState, UpdateTra
        if (existingAllocation == null) {
            throw new ResourceNotFoundException("allocation for model with id [" + modelId + "] not found");
        }
+        // If we are stopping, don't update anything
+        if (existingAllocation.getAllocationState().equals(AllocationState.STOPPING)) {
+            return currentState;


A debug log statement might be helpful here.

…com:benwtrent/elasticsearch into feature/ml-trained-model-allocation-service

dimitris-athanasiou · 2021-08-03T11:54:28Z

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

@@ -152,7 +152,7 @@ public void clusterStateProcessed(String source, ClusterState oldState, ClusterS
        });
    }

-    public void stopModelAllocation(String modelId, ActionListener<AcknowledgedResponse> listener) {
+    public void setModelAllocationToStopping(String modelId, ActionListener<AcknowledgedResponse> listener) {
        clusterService.submitStateUpdateTask("stop model allocation", new ClusterStateUpdateTask() {


Also update the name of the source

dimitris-athanasiou

LGTM 🚀

davidkyle · 2021-08-03T11:27:56Z

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java

+     */
+    static boolean isNodeShuttingDown(final ClusterState state, final String nodeId) {
+        // Right now we make no distinction between the type of shutdown, but maybe in the future we might?
+        return NodesShutdownMetadata.getShutdowns(state)


nit: sometimes this method is called in a loop it would be worth locally caching the result of getAllNodeMetadataMap()

davidkyle · 2021-08-03T11:41:31Z

...ain/java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationMetadata.java

+            TrainedModelAllocation.Builder allocation = modelRoutingEntries.get(modelId);
+            if (allocation == null) {
+                throw new ResourceNotFoundException(
+                    "unable to add node [{}] to model [{}] routing table as allocation does not exist",


Suggested change

"unable to add node [{}] to model [{}] routing table as allocation does not exist",

"unable to add failed node [{}] to model [{}] routing table as allocation does not exist",

davidkyle · 2021-08-03T12:13:02Z

.../java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationNodeService.java

+    private TaskAwareRequest taskAwareRequest(StartTrainedModelDeploymentAction.TaskParams params) {
+        final TrainedModelAllocationNodeService trainedModelAllocationNodeService = this;
+        return new TaskAwareRequest() {
+            final TaskId parentTaskId = new TaskId(nodeId, taskIdGenerator.incrementAndGet());


What is the parent task? When not use TaskId#EMPTY_TASK_ID as suggested in the comment in TaskAwareRequest.java

davidkyle · 2021-08-03T12:30:01Z

...main/java/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationService.java

+            observer.waitForNextChange(new ClusterStateObserver.Listener() {
+                @Override
+                public void onNewClusterState(ClusterState state) {
+                    listener.onResponse(TrainedModelAllocationMetadata.allocationForModelId(clusterState, modelId).orElse(null));


Suggested change

listener.onResponse(TrainedModelAllocationMetadata.allocationForModelId(clusterState, modelId).orElse(null));

listener.onResponse(TrainedModelAllocationMetadata.allocationForModelId(state, modelId).orElse(null));

Shouldn't the clusterstate in the method parameter be used here?

…d-model-allocation-service

davidkyle

LGTM2

…d-model-allocation-service

Trained model deployment memory usage is no longer determinable via persistent tasks. The new way is to look into the trained model allocation metadata. This PR updates this and removes some unused code. relates: #75778

[ML] Adding new trained model allocation service

8310788

benwtrent added >non-issue :ml Machine learning v8.0.0 labels Jul 28, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jul 28, 2021

benwtrent commented Jul 28, 2021

View reviewed changes

benwtrent added 2 commits July 28, 2021 13:44

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

7fc079e

…d-model-allocation-service

fixing action names + adding to operator constants

bd530bd

dimitris-athanasiou reviewed Jul 29, 2021

View reviewed changes

dimitris-athanasiou reviewed Jul 30, 2021

View reviewed changes

...va/org/elasticsearch/xpack/ml/inference/allocation/TrainedModelAllocationClusterService.java Outdated Show resolved Hide resolved

benwtrent added 3 commits July 30, 2021 07:24

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

5a6bf02

…d-model-allocation-service

addressing PR comments

42e4b08

fixing tests

096ad11

dimitris-athanasiou reviewed Jul 30, 2021

View reviewed changes

davidkyle reviewed Jul 30, 2021

View reviewed changes

benwtrent added 5 commits July 30, 2021 13:12

addressing PR comments

10f8842

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

e95ff18

…d-model-allocation-service

fixing build

6979365

adding state to trace logging

cad7ad3

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

a8eb7db

…d-model-allocation-service

Merge branch 'master' into feature/ml-trained-model-allocation-service

0070fe4

dimitris-athanasiou reviewed Aug 2, 2021

View reviewed changes

benwtrent added 2 commits August 2, 2021 12:36

addressing PR comments

412e7f6

Merge branch 'feature/ml-trained-model-allocation-service' of github.…

f4e8c27

…com:benwtrent/elasticsearch into feature/ml-trained-model-allocation-service

benwtrent requested review from davidkyle and dimitris-athanasiou August 2, 2021 16:42

simplifying

c375e4f

dimitris-athanasiou reviewed Aug 3, 2021

View reviewed changes

dimitris-athanasiou approved these changes Aug 3, 2021

View reviewed changes

davidkyle reviewed Aug 3, 2021

View reviewed changes

benwtrent added 2 commits August 3, 2021 09:27

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

013df98

…d-model-allocation-service

addressing PR comments

781ab76

benwtrent requested a review from davidkyle August 3, 2021 13:57

davidkyle approved these changes Aug 3, 2021

View reviewed changes

benwtrent added 2 commits August 3, 2021 10:36

fixing formatting

c0a5d7b

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

6ae2295

…d-model-allocation-service

benwtrent merged commit b11c15b into elastic:master Aug 3, 2021

benwtrent deleted the feature/ml-trained-model-allocation-service branch August 3, 2021 17:06

benwtrent mentioned this pull request Aug 3, 2021

[ML] updating node memory load for new allocation service #76046

Merged

mark-vieira added v8.0.0-alpha1 and removed v8.0.0 labels Aug 4, 2021

		if (RoutingState.FAILED.equals(nodeIdAndState.getValue().getState())) {
		nodeFailuresAndReasons.put(nodeIdAndState.getKey(), nodeIdAndState.getValue().getReason());

		// TODO: Do we want to remove from the modelIdToTask map? This would cause it to be reloaded by state updates on INITIALIZING
		modelIdToTask.remove(task.getModelId());

	"unable to add node [{}] to model [{}] routing table as allocation does not exist",
	"unable to add failed node [{}] to model [{}] routing table as allocation does not exist",

	listener.onResponse(TrainedModelAllocationMetadata.allocationForModelId(clusterState, modelId).orElse(null));
	listener.onResponse(TrainedModelAllocationMetadata.allocationForModelId(state, modelId).orElse(null));

[ML] Adding new trained model allocation service #75778

[ML] Adding new trained model allocation service #75778

Uh oh!

Conversation

benwtrent commented Jul 28, 2021

Uh oh!

elasticmachine commented Jul 28, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

benwtrent Aug 2, 2021 •

edited

Loading

dimitris-athanasiou left a comment •

edited

Loading