-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML][Data Frame] fixing _start?force=true bug #45660
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML][Data Frame] fixing _start?force=true bug #45660
Conversation
Pinging @elastic/ml-core |
@@ -235,7 +235,7 @@ public long getInProgressCheckpoint() { | |||
} | |||
} | |||
|
|||
public void setTaskStateStopped() { | |||
public synchronized void setTaskStateStopped() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this synchronized as start
stop
and markAsFailed
are all synchronized and we do not want task state flipping between states while we are attempting to transition it.
doSaveState
calls setTaskStateStopped
and since synchronized
utilize reentrant locks, stop
calling doSaveState
directly in certain scenarios should not be a problem.
// fully initialized. | ||
// If we are NOT failed, then we can assume that `start` was just called early in the process. | ||
String msg = taskState.get() == DataFrameTransformTaskState.FAILED ? | ||
"It failed during the initialization process; force stop to allow reinitialization." : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we failed before even being fully initialized, there was probably something REALLY wrong with the cluster at the time. We should indicate to the user that _stop?force=true
needs to be called before attempting to start again.
@@ -409,6 +417,13 @@ void persistStateToClusterState(DataFrameTransformState state, | |||
} | |||
|
|||
synchronized void markAsFailed(String reason, ActionListener<Void> listener) { | |||
// If we are already flagged as failed, this probably means that a second trigger started firing while we were attempting to | |||
// flag the previously triggered indexer as failed. Exit early as we are already flagged as failed. | |||
if (taskState.get() == DataFrameTransformTaskState.FAILED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason to move this up earlier in the stack is to prevent needless checks. Additionally, we don't want to accidentally return weird error messages as a task could potentially (however rarely) be FAILED
but the indexer be in STOPPED
or STOPPING
state.
taskState.set(DataFrameTransformTaskState.FAILED); | ||
stateReason.set(reason); | ||
DataFrameTransformState newState = getState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is safe to call now like this since setTaskStateStopped
is synchronized
run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* [ML][Data Frame] fixing _start?force=true bug * removing unused import * removing old TODO
The follow scenario currently blows up in the user's face with confusing errors:
_start?force=true
against the transforThis is because we exit the process early and don't allow state/stats to be fully initialized in a failed task so it can be started.
Since PR: #45627 we now check for
FAILED
state in thestart
method. This allows us to fully initialize the task and rely onstart
to fail due to the previously loaded state from its creation.