[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74141

droberts195 · 2021-06-15T16:33:40Z

I am pretty sure this is a side effect of #73908. Resetting jobs on assignment to a new node doesn't seem to fit reliably with ML upgrade mode.

Build scan:
https://gradle-enterprise.elastic.co/s/cu5xpsi2gbnkg/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=ml%2Fset_upgrade_mode%2FSetting%20upgrade%20mode%20to%20disabled%20from%20enabled%7D

Reproduction line:
gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled}" -Dtests.seed=1E2DD794B9477D7B -Dtests.locale=th-TH -Dtests.timezone=America/St_Barthelemy -Druntime.java=11

Applicable branches:
master, 7.x

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.test.rest.XPackRestIT&tests.test=test%20%7Bp0%3Dml/set_upgrade_mode/Setting%20upgrade%20mode%20to%20disabled%20from%20enabled%7D

Failure excerpt:

java.lang.AssertionError: Failure at [ml/set_upgrade_mode:53]: expected [2xx] status code but api [ml.open_job] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"exception","reason":"Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened","stack_trace":"org.elasticsearch.ElasticsearchException: Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened\r\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:61)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:321)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:266)\r\n\tat org.elasticsearch.persistent.PersistentTasksService.lambda$waitForPersistentTaskCondition$1(PersistentTasksService.java:142)\r\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:180)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:500)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:834)\r\n"}],"type":"exception","reason":"Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened","stack_trace":"org.elasticsearch.ElasticsearchException: Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened\r\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:61)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:321)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:266)\r\n\tat org.elasticsearch.persistent.PersistentTasksService.lambda$waitForPersistentTaskCondition$1(PersistentTasksService.java:142)\r\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:180)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:500)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:834)\r\n"},"status":500}]

  at __randomizedtesting.SeedInfo.seed([1E2DD794B9477D7B:9679E84E17BB1083]:0)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:451)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:419)
  at jdk.internal.reflect.GeneratedMethodAccessor13.invoke(null:-1)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:566)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:834)

  Caused by: java.lang.AssertionError: expected [2xx] status code but api [ml.open_job] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"exception","reason":"Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened","stack_trace":"org.elasticsearch.ElasticsearchException: Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened\r\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:61)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:321)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:266)\r\n\tat org.elasticsearch.persistent.PersistentTasksService.lambda$waitForPersistentTaskCondition$1(PersistentTasksService.java:142)\r\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:180)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:500)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:834)\r\n"}],"type":"exception","reason":"Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened","stack_trace":"org.elasticsearch.ElasticsearchException: Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened\r\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:61)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:321)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:266)\r\n\tat org.elasticsearch.persistent.PersistentTasksService.lambda$waitForPersistentTaskCondition$1(PersistentTasksService.java:142)\r\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:180)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:500)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:834)\r\n"},"status":500}]

    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:354)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:444)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:419)
    at jdk.internal.reflect.GeneratedMethodAccessor13.invoke(null:-1)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:566)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
    at java.lang.Thread.run(Thread.java:834)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-06-15T16:33:43Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2021-06-15T16:40:29Z

The 3 tests that have failed in https://gradle-enterprise.elastic.co/s/cu5xpsi2gbnkg all fail because of this error in the server-side logs:

org.elasticsearch.ElasticsearchException: Failed to connect to autodetect for job set-upgrade-mode-job
...
Caused by: java.io.FileNotFoundException: \\.\pipe\autodetect_set-upgrade-mode-job_log_7376 (All pipe instances are busy)

This is happening because the job's C++ process is still running from a previous invocation, but the Java code has lost track of it. The root cause is this:

[2021-06-15T10:46:55,249][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Disabling upgrade mode, must wait for tasks to not have AWAITING_UPGRADE assignment
[2021-06-15T10:46:55,299][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [cannot start datafeed [set-upgrade-mode-job-datafeed], because the job's [set-upgrade-mode-job] state is stale]
[2021-06-15T10:46:55,303][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Done waiting for tasks to be out of AWAITING_UPGRADE
[2021-06-15T10:46:55,303][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-06-15T10:46:55,307][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
[2021-06-15T10:46:55,310][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; resetting as no snapshot exists
[2021-06-15T10:46:55,370][INFO ][o.e.x.m.a.TransportResetJobAction] [yamlRestTest-0] [set-upgrade-mode-job] Resetting job
[2021-06-15T10:46:55,378][INFO ][o.e.x.m.j.p.JobDataDeleter] [yamlRestTest-0] [set-upgrade-mode-job] running delete by query on [.ml-anomalies-shared]
[2021-06-15T10:46:55,481][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-06-15T10:46:55,595][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-06-15T10:46:55,596][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [null]
[2021-06-15T10:46:55,596][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Marking job task as completed
[2021-06-15T10:46:55,712][INFO ][o.e.x.m.a.TransportResetJobAction] [yamlRestTest-0] [set-upgrade-mode-job] Reset has successfully completed
[2021-06-15T10:46:55,712][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job]
[2021-06-15T10:46:55,716][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-anomalies-shared/qM3FmrYkSx2htlRAOhXdnQ] deleting index
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-state-000001/ahSPTYdyQXqTKkO8iiR15g] deleting index
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-notifications-000001/C6WOJp9MT0aU9t_AC8M9Bw] deleting index
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-6/EjixjpAhQFudeuw1JnYY9A] deleting index
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-config/roiUOnKQTJewriDQ91fkGQ] deleting index
[2021-06-15T10:46:55,971][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [airline-data/OQq0HfjLTQK2HkrxqKk9GQ] deleting index
[2021-06-15T10:46:56,111][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [yamlRestTest-0] adding index lifecycle policy [.fleet-actions-results-ilm-policy]
[2021-06-15T10:46:56,209][WARN ][o.e.p.PersistentTasksClusterService] [yamlRestTest-0] trying to update state on non-existing task job-set-upgrade-mode-job
[2021-06-15T10:46:56,210][ERROR][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Could not set job state to [opened] for job [set-upgrade-mode-job]
org.elasticsearch.ResourceNotFoundException: the task with id job-set-upgrade-mode-job and allocation id 71 doesn't exist

Since we are seeing this problem for the first time after adding a call to reset in the node assignment chain it seems very likely that something needs changing there. Please can you have a look @dimitris-athanasiou? I think this will affect users in 7.14 if we don't fix it before release.

dimitris-athanasiou · 2021-06-18T12:43:14Z

I took some time looking into this. I believe the problem is caused by the fact that something triggered a feature reset. Before the job reset starting we can see in the logs:

[2021-06-15T10:46:55,307][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset

This is what later kills the job process and, probably, causes the job to fail connecting to the process.

What could be triggering a feature reset in the middle of the test?

cc @benwtrent

droberts195 · 2021-06-18T14:29:36Z

What could be triggering a feature reset in the middle of the test?

It doesn't help that none of the 3 tests that failed ran at the time the root cause of the problem occurred. Looking at the full log of test start and stop times from the console log we see that the tests running around the time of the problem were:

  1> [2564-06-15T06:46:50,736][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=runtime_fields/30_double/terms agg] before test
  1> [2564-06-15T06:46:51,415][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=runtime_fields/30_double/terms agg] after test
  1> [2564-06-15T06:46:51,433][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/datafeeds_crud/Test put datafeed whose id is already taken] before test
  1> [2564-06-15T06:46:53,014][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/datafeeds_crud/Test put datafeed whose id is already taken] after test
  1> [2564-06-15T06:46:53,031][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade_mode to enabled] before test
  1> [2564-06-15T06:46:56,377][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade_mode to enabled] after test
  1> [2564-06-15T06:46:56,396][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=unsigned_long/40_different_numeric/Aggregation of different numeric types is supported] before test
  1> [2564-06-15T06:46:57,226][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=unsigned_long/40_different_numeric/Aggregation of different numeric types is supported] after test

Therefore, I think that a feature reset didn't run in the middle of a test, but a feature reset was cleaning up at the end of ml/set_upgrade_mode/Setting upgrade_mode to enabled.

That test is a bit unusual in that it ends with upgrade mode enabled, so our end-of-test cleanup then has to work while upgrade mode is enabled.

droberts195 · 2021-06-18T14:33:21Z

Here's a bit more of the server side log (the bit that comes immediately before what's include in #74141 (comment)):

[2021-06-15T10:46:54,672][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-06-15T10:46:54,717][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/9768] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-06-15T10:46:54,883][INFO ][o.e.x.m.d.DatafeedJob    ] [yamlRestTest-0] [set-upgrade-mode-job] Datafeed started (from: 1970-01-01T00:00:00.000Z to: real-time) with frequency [600000ms]
[2021-06-15T10:46:54,888][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [true] from [false]
[2021-06-15T10:46:54,891][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Flushing job set-upgrade-mode-job
[2021-06-15T10:46:54,891][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Executing cluster state update
[2021-06-15T10:46:54,956][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Cluster update response built: true
[2021-06-15T10:46:54,956][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Enabling upgrade mode, must isolate datafeeds
[2021-06-15T10:46:54,956][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolating datafeeds: [set-upgrade-mode-job-datafeed]
[2021-06-15T10:46:54,960][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolated the datafeeds
[2021-06-15T10:46:54,960][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Un-assigning persistent tasks : [ datafeed-set-upgrade-mode-job-datafeed, job-set-upgrade-mode-job ]
[2021-06-15T10:46:55,011][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task cannot be assigned while upgrade mode is enabled.]
[2021-06-15T10:46:55,018][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] [68]
[2021-06-15T10:46:55,019][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]
[2021-06-15T10:46:55,019][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] try lock [0s] to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]...
[2021-06-15T10:46:55,019][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] stopping datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job], acquired [true]...
[2021-06-15T10:46:55,021][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job] has been stopped
[2021-06-15T10:46:55,068][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [false]; reason = [task has been removed, cancelling locally]
[2021-06-15T10:46:55,068][INFO ][o.e.x.m.j.p.a.ProcessContext] [yamlRestTest-0] Killing job [set-upgrade-mode-job], because [task has been removed, cancelling locally]
[2021-06-15T10:46:55,071][INFO ][o.e.x.m.j.p.a.o.AutodetectResultProcessor] [yamlRestTest-0] [set-upgrade-mode-job] 0 buckets parsed from autodetect output
[2021-06-15T10:46:55,171][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Waited for tasks to be unassigned
[2021-06-15T10:46:55,171][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-06-15T10:46:55,193][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [false] from [true]
[2021-06-15T10:46:55,193][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Executing cluster state update
[2021-06-15T10:46:55,249][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Cluster update response built: true

[2021-06-15T10:46:55,193][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [false] from [true] is because there's a teardown section in set_upgrade_mode.yml that unsets upgrade mode.

That test is a bit unusual in that it ends with upgrade mode enabled, so our end-of-test cleanup then has to work while upgrade mode is enabled.

So actually that statement was wrong. After the teardown completes the end-of-test cleanup has to run with upgrade mode disabled, but only just disabled, and the effects of disabling it still playing out.

droberts195 · 2021-06-18T15:07:48Z

We discussed this more on Slack.

The problem appears to be that feature reset kills the job while the job persistent task executor is doing job reset before starting the process. That kill request seems like it is reporting success - fair enough as no process was running at the time. But then, after the job reset completes the job persistent task executor is moving on to start the C++ process. This then becomes an invisible ninja process that the Java side has no knowledge of, but which causes problems later on.

If a kill request is received while the job persistent task executor is going through its open sequence but before the process is running then it needs to not start the process. It needs to remember that the process was "killed" before it was even running and not start it in this situation.

For #74141

davidkyle · 2021-06-21T09:49:03Z

A couple more failures:
https://gradle-enterprise.elastic.co/s/dwjvxvha4tofq
https://gradle-enterprise.elastic.co/s/yf5nglqxorezy

Muted on master in #74351

droberts195 · 2021-06-21T10:10:35Z

Muted on master in #74351

Although it was the first test in the suite to fail when this issue was first opened I think ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled is the wrong test to mute for this - see #74141 (comment):

Therefore, I think that a feature reset didn't run in the middle of a test, but a feature reset was cleaning up at the end of ml/set_upgrade_mode/Setting upgrade_mode to enabled.

What happens is that the cleanup after ml/set_upgrade_mode/Setting upgrade_mode to enabled causes an inconsistency, and then all the other tests in the suite that happen to run afterwards fail (and the test order is random, so it varies). So I think the one to mute is ml/set_upgrade_mode/Setting upgrade_mode to enabled.

This commit checks if the job has been requested to close after the reset action completes as part of allocating the job to a new node. This ensures we do not proceed to start the job process even though the job had been requested to close. Closes elastic#74141

While the job is opening it is possible that the kill process action is called. If the kill process action is received before the job process has started, we currently start the process anyway. The process will eventually timeout to connect to anything and will exit. However, it may cause an unexpected failure if the job is opened again as it won't be able to launch a process as one would already exist. This commit ensures the JobTask.isClosing() reports true when the kill process action has been called in order to abort opening the process. Closes #74141

…#74441) While the job is opening it is possible that the kill process action is called. If the kill process action is received before the job process has started, we currently start the process anyway. The process will eventually timeout to connect to anything and will exit. However, it may cause an unexpected failure if the job is opened again as it won't be able to launch a process as one would already exist. This commit ensures the JobTask.isClosing() reports true when the kill process action has been called in order to abort opening the process. Closes #74141 Backport of #74415

droberts195 · 2021-07-06T09:22:41Z

There's still a problem here.

The failure in https://gradle-enterprise.elastic.co/s/eaysuxbqtxyp4 is an evolution of the original race condition following the fix.

The server-side log shows this:

[2021-07-05T19:07:03,994][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Disabling upgrade mode, must wait for tasks to not have AWAITING_UPGRADE assignment
[2021-07-05T19:07:04,007][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [cannot start datafeed [set-upgrade-mode-job-datafeed], because the job's [set-upgrade-mode-job] state is stale]
[2021-07-05T19:07:04,009][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Done waiting for tasks to be out of AWAITING_UPGRADE
[2021-07-05T19:07:04,009][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-07-05T19:07:04,013][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
[2021-07-05T19:07:04,014][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; resetting as no snapshot exists
[2021-07-05T19:07:04,025][INFO ][o.e.x.m.a.TransportResetJobAction] [yamlRestTest-0] [set-upgrade-mode-job] Resetting job
[2021-07-05T19:07:04,033][INFO ][o.e.x.m.j.p.JobDataDeleter] [yamlRestTest-0] [set-upgrade-mode-job] running delete by query on [.ml-anomalies-shared]
[2021-07-05T19:07:04,075][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-07-05T19:07:04,122][INFO ][o.e.x.m.a.TransportResetJobAction] [yamlRestTest-0] [set-upgrade-mode-job] Reset has successfully completed
[2021-07-05T19:07:04,122][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job] 
[2021-07-05T19:07:04,125][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-07-05T19:07:04,125][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [kill process (api)] 
[2021-07-05T19:07:04,125][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Aborted opening job [set-upgrade-mode-job] as it has been closed
[2021-07-05T19:07:04,152][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Attempting to close job [set-upgrade-mode-job], because [close job (api)]
[2021-07-05T19:07:04,152][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Cannot close job [set-upgrade-mode-job] as it has already been closed or is closing

What has happened is that the test code of ml/set_upgrade_mode/Setting upgrade_mode to enabled ran without problems. Then the teardown section ran, and disabled upgrade mode, causing the job to reopen. Then the generic post-test cleanup code ran, which closes all the jobs, and this hung.

So it appears that we violate the "it's not an error to close a job twice" rule when we go through the "Cannot close job [xxx] as it has already been closed or is closing" code path.

There was a point during the job opening sequence where performing a feature reset could hang. This happened when the kill request issued by feature reset was executed after the job's persistent task was assigned but before the job's native process was started. The persistent task was incorrectly left running in this situation, yet the job opening sequence was aborted which meant the subsequent close request issued by feature reset would wait for a very long time for the persistent task to disappear. The fix is to make the kill process request cancel the persistent task consistently based on its request parameters and not on the current state of the task. Fixes elastic#74141

There was a point during the job opening sequence where performing a feature reset could hang. This happened when the kill request issued by feature reset was executed after the job's persistent task was assigned but before the job's native process was started. The persistent task was incorrectly left running in this situation, yet the job opening sequence was aborted which meant the subsequent close request issued by feature reset would wait for a very long time for the persistent task to disappear. The fix is to make the kill process request cancel the persistent task consistently based on its request parameters and not on the current state of the task. Fixes #74141

droberts195 added :ml Machine learning >test-failure Triaged test failures from CI labels Jun 15, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jun 15, 2021

davidkyle mentioned this issue Jun 21, 2021

[ML] Mute set_upgrade_mode test #74351

Merged

davidkyle added a commit that referenced this issue Jun 21, 2021

Mute ml set_upgrade_mode test (#74351)

90e8716

For #74141

dimitris-athanasiou mentioned this issue Jun 22, 2021

[ML] Abort starting process if kill request is received #74415

Merged

dimitris-athanasiou closed this as completed in #74415 Jun 22, 2021

dimitris-athanasiou mentioned this issue Jun 22, 2021

[7.x][ML] Abort starting process if kill request is received (#74415) #74441

Merged

droberts195 mentioned this issue Jun 23, 2021

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74497

Closed

droberts195 reopened this Jul 6, 2021

droberts195 mentioned this issue Jul 6, 2021

[ML] Fix race condition between job opening and feature reset #74976

Merged

droberts195 closed this as completed in #74976 Jul 7, 2021

droberts195 mentioned this issue Jul 8, 2021

[CI] XPackRestIT failing on various YAML tests #75069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74141

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74141

droberts195 commented Jun 15, 2021

elasticmachine commented Jun 15, 2021

Uh oh!

droberts195 commented Jun 15, 2021

Uh oh!

dimitris-athanasiou commented Jun 18, 2021 •

edited

Loading

Uh oh!

droberts195 commented Jun 18, 2021

Uh oh!

droberts195 commented Jun 18, 2021 •

edited

Loading

Uh oh!

droberts195 commented Jun 18, 2021

Uh oh!

davidkyle commented Jun 21, 2021

Uh oh!

droberts195 commented Jun 21, 2021

Uh oh!

droberts195 commented Jul 6, 2021

Uh oh!

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74141

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74141

Comments

droberts195 commented Jun 15, 2021

elasticmachine commented Jun 15, 2021

Uh oh!

droberts195 commented Jun 15, 2021

Uh oh!

dimitris-athanasiou commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

droberts195 commented Jun 18, 2021

Uh oh!

droberts195 commented Jun 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

droberts195 commented Jun 18, 2021

Uh oh!

davidkyle commented Jun 21, 2021

Uh oh!

droberts195 commented Jun 21, 2021

Uh oh!

droberts195 commented Jul 6, 2021

Uh oh!

dimitris-athanasiou commented Jun 18, 2021 •

edited

Loading

droberts195 commented Jun 18, 2021 •

edited

Loading