Skip to content

[ML] Wait for model process to stop in stop deployment #83644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 8, 2022

Conversation

davidkyle
Copy link
Member

When stopping a deployment the tasks API was used to wait for the model task to finish but the action used in the request did not match the model task action so the request would return success without waiting. The code would then continue and delete the model allocation. Deleting the allocation would be noticed on the node running the task and if the task had not stopped yet it would then stop the task resulting in a double stop and an error being logged as this is unexpected. The bug is minor and it only manifests in the log files, the stop deployment is still successful.

The fix is to add a listener to the task stop API that responds only after the task is stopped.

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Feb 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@elasticsearchmachine
Copy link
Collaborator

Hi @davidkyle, I've created a changelog YAML for you.

@benwtrent benwtrent self-requested a review February 8, 2022 12:19
@@ -80,15 +82,11 @@ public TaskParams getParams() {
return params;
}

public void stop(String reason) {
logger.debug("[{}] Stopping due to reason [{}]", getModelId(), reason);
licensedFeature.stopTracking(licenseState, "model-" + params.getModelId());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this still needs to be called. If you are concerned, maybe wrap the listener and call this on response/failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm not happy with this pattern. The task asks the node service to stop then the node service calls back to TrainedModelDeploymentTask::markAsStopped from stopDeploymentAsync. This means markAsStopped is a public method which doesn't make much sense as part of the public API.

The problem is there are a few ways TrainedModelAllocationNodeService::stopDeploymentAsync can be called such as the service noticing the deployment has been deleted or the task being cancelled.

I deleted these lines because markAsStopped (formerly stopWithoutNotification) was being called anyway and the same work occurs there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I deleted these lines because markAsStopped (formerly stopWithoutNotification) was being called anyway and the same work occurs there.

Gotcha, we just need to be careful and make sure that stopTracking is being called.

listener.onResponse(new StopTrainedModelDeploymentAction.Response(true));
task.stop(
"undeploy_trained_model (api)",
ActionListener.wrap(r -> listener.onResponse(new StopTrainedModelDeploymentAction.Response(true)), listener::onFailure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much cleaner and the execution path is more easily read.

.prepareListTasks(nodesOfConcern.toArray(String[]::new))
.setDetailed(true)
.setWaitForCompletion(true)
.setActions(modelId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this bug was added in: #81259

The actions used to contain the model id. Regardless, the new stopping path is much cleaner.

@davidkyle davidkyle changed the title [ML] Wait for model process to be stop in stop deployment [ML] Wait for model process to stop in stop deployment Feb 8, 2022
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.1

weizijun added a commit to weizijun/elasticsearch that referenced this pull request Feb 9, 2022
* upstream/master: (166 commits)
  Bind host all instead of just _site_ when needed (elastic#83145)
  [DOCS] Fix min/max agg snippets for histograms (elastic#83695)
  [DOCS] Add deprecation notice for system indices (elastic#83688)
  Cache ILM policy name on IndexMetadata (elastic#83603)
  [DOCS] Fix 8.0 breaking changes sort order (elastic#83685)
  [ML] fix random sampling background query consistency (elastic#83676)
  Move internal APIs into their own namespace '_internal'
  Runtime fields core-with-mapped tests support tsdb (elastic#83577)
  Optimize calculating the presence of a quorum (elastic#83638)
  Use switch expressions in EnableAllocationDecider and NodeShutdownAllocationDecider (elastic#83641)
  Note libffi error message in tmpdir docs (elastic#83662)
  Fix TransportDesiredNodesActionsIT batch tests (elastic#83659)
  [DOCS] Remove unused upgrade doc files (elastic#83617)
  [ML] Wait for model process to stop in stop deployment (elastic#83644)
  [ML] Fix submit after shutdown in process worker service (elastic#83645)
  Remove req/resp classes associated with HLRC (elastic#83599)
  Introduce index.version.compatibility setting (elastic#83264)
  Rename InternalTestCluster#getMasterNodeInstance (elastic#83407)
  Mute TimeSeriesIndexSearcherTests testCollectInOrderAcrossSegments (elastic#83648)
  Add rollover add max_primary_shard_docs condition (elastic#80981)
  ...

# Conflicts:
#	x-pack/plugin/rollup/build.gradle
#	x-pack/plugin/rollup/src/test/java/org/elasticsearch/xpack/rollup/v2/RollupActionSingleNodeTests.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.1.0 v8.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants