[ML][Data Frame] fixing _start?force=true bug #45660

benwtrent · 2019-08-16T15:33:02Z

The follow scenario currently blows up in the user's face with confusing errors:

Transform Task fails and the cluster state is updated to reflect such
The node running the transform task gets shutdown and the task gets re-allocated to another node
The user then tries to do _start?force=true against the transfor
The transform is not fully initialized and tells the user to "try again later"
"Later" never arrives
Billions of years pass
The universe experiences heat-death

This is because we exit the process early and don't allow state/stats to be fully initialized in a failed task so it can be started.

Since PR: #45627 we now check for FAILED state in the start method. This allows us to fully initialize the task and rely on start to fail due to the previously loaded state from its creation.

elasticmachine · 2019-08-16T15:33:05Z

Pinging @elastic/ml-core

benwtrent · 2019-08-16T15:38:13Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

@@ -235,7 +235,7 @@ public long getInProgressCheckpoint() {
        }
    }

-    public void setTaskStateStopped() {
+    public synchronized void setTaskStateStopped() {


I made this synchronized as start stop and markAsFailed are all synchronized and we do not want task state flipping between states while we are attempting to transition it.

doSaveState calls setTaskStateStopped and since synchronized utilize reentrant locks, stop calling doSaveState directly in certain scenarios should not be a problem.

benwtrent · 2019-08-16T15:39:23Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

+            // fully initialized.
+            // If we are NOT failed, then we can assume that `start` was just called early in the process.
+            String msg = taskState.get() == DataFrameTransformTaskState.FAILED ?
+                "It failed during the initialization process; force stop to allow reinitialization." :


If we failed before even being fully initialized, there was probably something REALLY wrong with the cluster at the time. We should indicate to the user that _stop?force=true needs to be called before attempting to start again.

benwtrent · 2019-08-16T15:40:27Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

@@ -409,6 +417,13 @@ void persistStateToClusterState(DataFrameTransformState state,
    }

    synchronized void markAsFailed(String reason, ActionListener<Void> listener) {
+        // If we are already flagged as failed, this probably means that a second trigger started firing while we were attempting to
+        // flag the previously triggered indexer as failed. Exit early as we are already flagged as failed.
+        if (taskState.get() == DataFrameTransformTaskState.FAILED) {


The only reason to move this up earlier in the stack is to prevent needless checks. Additionally, we don't want to accidentally return weird error messages as a task could potentially (however rarely) be FAILED but the indexer be in STOPPED or STOPPING state.

benwtrent · 2019-08-16T15:40:51Z

...frame/src/main/java/org/elasticsearch/xpack/dataframe/transforms/DataFrameTransformTask.java

        taskState.set(DataFrameTransformTaskState.FAILED);
        stateReason.set(reason);
+        DataFrameTransformState newState = getState();


this is safe to call now like this since setTaskStateStopped is synchronized

benwtrent · 2019-08-16T16:47:00Z

run elasticsearch-ci/2

davidkyle

LGTM

* [ML][Data Frame] fixing _start?force=true bug * removing unused import * removing old TODO

[ML][Data Frame] fixing _start?force=true bug

f339d96

benwtrent added >bug v8.0.0 :ml/Transform Transform v7.4.0 labels Aug 16, 2019

benwtrent commented Aug 16, 2019

View reviewed changes

benwtrent added 2 commits August 16, 2019 10:46

removing unused import

21fb66f

removing old TODO

31972b5

Merge branch 'master' into feature/ml-df-fix-force_start-state-failure

3009494

davidkyle approved these changes Aug 19, 2019

View reviewed changes

benwtrent merged commit ce6106f into elastic:master Aug 20, 2019

benwtrent deleted the feature/ml-df-fix-force_start-state-failure branch August 20, 2019 12:52

benwtrent mentioned this pull request Aug 20, 2019

[7.x] [ML][Data Frame] fixing _start?force=true bug (#45660) #45734

Merged

benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Aug 20, 2019

[ML][Data Frame] fixing _start?force=true bug (elastic#45660)

15302ff

* [ML][Data Frame] fixing _start?force=true bug * removing unused import * removing old TODO

benwtrent added a commit that referenced this pull request Aug 20, 2019

[ML][Data Frame] fixing _start?force=true bug (#45660) (#45734)

43bb592

* [ML][Data Frame] fixing _start?force=true bug * removing unused import * removing old TODO

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML][Data Frame] fixing _start?force=true bug #45660

[ML][Data Frame] fixing _start?force=true bug #45660

Uh oh!

benwtrent commented Aug 16, 2019

Uh oh!

elasticmachine commented Aug 16, 2019

Uh oh!

benwtrent Aug 16, 2019

Uh oh!

benwtrent Aug 16, 2019

Uh oh!

benwtrent Aug 16, 2019

Uh oh!

benwtrent Aug 16, 2019

Uh oh!

benwtrent commented Aug 16, 2019

Uh oh!

davidkyle left a comment

Uh oh!

Uh oh!

[ML][Data Frame] fixing _start?force=true bug #45660

[ML][Data Frame] fixing _start?force=true bug #45660

Uh oh!

Conversation

benwtrent commented Aug 16, 2019

Uh oh!

elasticmachine commented Aug 16, 2019

Uh oh!

benwtrent Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

benwtrent Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

benwtrent Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

benwtrent Aug 16, 2019

Choose a reason for hiding this comment

Uh oh!

benwtrent commented Aug 16, 2019

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!