Skip to content

[ML][Data Frame] fixing _start?force=true bug #45660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

benwtrent
Copy link
Member

The follow scenario currently blows up in the user's face with confusing errors:

  • Transform Task fails and the cluster state is updated to reflect such
  • The node running the transform task gets shutdown and the task gets re-allocated to another node
  • The user then tries to do _start?force=true against the transfor
  • The transform is not fully initialized and tells the user to "try again later"
  • "Later" never arrives
  • Billions of years pass
  • The universe experiences heat-death

This is because we exit the process early and don't allow state/stats to be fully initialized in a failed task so it can be started.

Since PR: #45627 we now check for FAILED state in the start method. This allows us to fully initialize the task and rely on start to fail due to the previously loaded state from its creation.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@@ -235,7 +235,7 @@ public long getInProgressCheckpoint() {
}
}

public void setTaskStateStopped() {
public synchronized void setTaskStateStopped() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this synchronized as start stop and markAsFailed are all synchronized and we do not want task state flipping between states while we are attempting to transition it.

doSaveState calls setTaskStateStopped and since synchronized utilize reentrant locks, stop calling doSaveState directly in certain scenarios should not be a problem.

// fully initialized.
// If we are NOT failed, then we can assume that `start` was just called early in the process.
String msg = taskState.get() == DataFrameTransformTaskState.FAILED ?
"It failed during the initialization process; force stop to allow reinitialization." :
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we failed before even being fully initialized, there was probably something REALLY wrong with the cluster at the time. We should indicate to the user that _stop?force=true needs to be called before attempting to start again.

@@ -409,6 +417,13 @@ void persistStateToClusterState(DataFrameTransformState state,
}

synchronized void markAsFailed(String reason, ActionListener<Void> listener) {
// If we are already flagged as failed, this probably means that a second trigger started firing while we were attempting to
// flag the previously triggered indexer as failed. Exit early as we are already flagged as failed.
if (taskState.get() == DataFrameTransformTaskState.FAILED) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reason to move this up earlier in the stack is to prevent needless checks. Additionally, we don't want to accidentally return weird error messages as a task could potentially (however rarely) be FAILED but the indexer be in STOPPED or STOPPING state.

taskState.set(DataFrameTransformTaskState.FAILED);
stateReason.set(reason);
DataFrameTransformState newState = getState();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is safe to call now like this since setTaskStateStopped is synchronized

@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit ce6106f into elastic:master Aug 20, 2019
@benwtrent benwtrent deleted the feature/ml-df-fix-force_start-state-failure branch August 20, 2019 12:52
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 20, 2019
* [ML][Data Frame] fixing _start?force=true bug

* removing unused import

* removing old TODO
benwtrent added a commit that referenced this pull request Aug 20, 2019
* [ML][Data Frame] fixing _start?force=true bug

* removing unused import

* removing old TODO
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants