Skip to content

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74141

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
droberts195 opened this issue Jun 15, 2021 · 9 comments · Fixed by #74415 or #74976
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

I am pretty sure this is a side effect of #73908. Resetting jobs on assignment to a new node doesn't seem to fit reliably with ML upgrade mode.

Build scan:
https://gradle-enterprise.elastic.co/s/cu5xpsi2gbnkg/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=ml%2Fset_upgrade_mode%2FSetting%20upgrade%20mode%20to%20disabled%20from%20enabled%7D

Reproduction line:
gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled}" -Dtests.seed=1E2DD794B9477D7B -Dtests.locale=th-TH -Dtests.timezone=America/St_Barthelemy -Druntime.java=11

Applicable branches:
master, 7.x

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.test.rest.XPackRestIT&tests.test=test%20%7Bp0%3Dml/set_upgrade_mode/Setting%20upgrade%20mode%20to%20disabled%20from%20enabled%7D

Failure excerpt:

java.lang.AssertionError: Failure at [ml/set_upgrade_mode:53]: expected [2xx] status code but api [ml.open_job] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"exception","reason":"Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened","stack_trace":"org.elasticsearch.ElasticsearchException: Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened\r\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:61)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:321)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:266)\r\n\tat org.elasticsearch.persistent.PersistentTasksService.lambda$waitForPersistentTaskCondition$1(PersistentTasksService.java:142)\r\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:180)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:500)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:834)\r\n"}],"type":"exception","reason":"Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened","stack_trace":"org.elasticsearch.ElasticsearchException: Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened\r\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:61)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:321)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:266)\r\n\tat org.elasticsearch.persistent.PersistentTasksService.lambda$waitForPersistentTaskCondition$1(PersistentTasksService.java:142)\r\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:180)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:500)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:834)\r\n"},"status":500}]

  at __randomizedtesting.SeedInfo.seed([1E2DD794B9477D7B:9679E84E17BB1083]:0)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:451)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:419)
  at jdk.internal.reflect.GeneratedMethodAccessor13.invoke(null:-1)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:566)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:834)

  Caused by: java.lang.AssertionError: expected [2xx] status code but api [ml.open_job] returned [500 Internal Server Error] [{"error":{"root_cause":[{"type":"exception","reason":"Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened","stack_trace":"org.elasticsearch.ElasticsearchException: Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened\r\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:61)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:321)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:266)\r\n\tat org.elasticsearch.persistent.PersistentTasksService.lambda$waitForPersistentTaskCondition$1(PersistentTasksService.java:142)\r\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:180)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:500)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:834)\r\n"}],"type":"exception","reason":"Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened","stack_trace":"org.elasticsearch.ElasticsearchException: Unexpected job state [failed] with reason [Failed to connect to autodetect for job set-upgrade-mode-job] while waiting for job to be opened\r\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:61)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:321)\r\n\tat org.elasticsearch.xpack.ml.action.TransportOpenJobAction$JobPredicate.test(TransportOpenJobAction.java:266)\r\n\tat org.elasticsearch.persistent.PersistentTasksService.lambda$waitForPersistentTaskCondition$1(PersistentTasksService.java:142)\r\n\tat org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.clusterChanged(ClusterStateObserver.java:180)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:500)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407)\r\n\tat org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151)\r\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)\r\n\tat org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r\n\tat java.base/java.lang.Thread.run(Thread.java:834)\r\n"},"status":500}]

    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:354)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:444)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:419)
    at jdk.internal.reflect.GeneratedMethodAccessor13.invoke(null:-1)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:566)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
    at java.lang.Thread.run(Thread.java:834)

@droberts195 droberts195 added :ml Machine learning >test-failure Triaged test failures from CI labels Jun 15, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

The 3 tests that have failed in https://gradle-enterprise.elastic.co/s/cu5xpsi2gbnkg all fail because of this error in the server-side logs:

org.elasticsearch.ElasticsearchException: Failed to connect to autodetect for job set-upgrade-mode-job
...
Caused by: java.io.FileNotFoundException: \\.\pipe\autodetect_set-upgrade-mode-job_log_7376 (All pipe instances are busy)

This is happening because the job's C++ process is still running from a previous invocation, but the Java code has lost track of it. The root cause is this:

[2021-06-15T10:46:55,249][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Disabling upgrade mode, must wait for tasks to not have AWAITING_UPGRADE assignment
[2021-06-15T10:46:55,299][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [cannot start datafeed [set-upgrade-mode-job-datafeed], because the job's [set-upgrade-mode-job] state is stale]
[2021-06-15T10:46:55,303][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Done waiting for tasks to be out of AWAITING_UPGRADE
[2021-06-15T10:46:55,303][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-06-15T10:46:55,307][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
[2021-06-15T10:46:55,310][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; resetting as no snapshot exists
[2021-06-15T10:46:55,370][INFO ][o.e.x.m.a.TransportResetJobAction] [yamlRestTest-0] [set-upgrade-mode-job] Resetting job
[2021-06-15T10:46:55,378][INFO ][o.e.x.m.j.p.JobDataDeleter] [yamlRestTest-0] [set-upgrade-mode-job] running delete by query on [.ml-anomalies-shared]
[2021-06-15T10:46:55,481][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-06-15T10:46:55,595][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-06-15T10:46:55,596][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [null]
[2021-06-15T10:46:55,596][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Marking job task as completed
[2021-06-15T10:46:55,712][INFO ][o.e.x.m.a.TransportResetJobAction] [yamlRestTest-0] [set-upgrade-mode-job] Reset has successfully completed
[2021-06-15T10:46:55,712][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job]
[2021-06-15T10:46:55,716][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-anomalies-shared/qM3FmrYkSx2htlRAOhXdnQ] deleting index
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-state-000001/ahSPTYdyQXqTKkO8iiR15g] deleting index
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-notifications-000001/C6WOJp9MT0aU9t_AC8M9Bw] deleting index
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-6/EjixjpAhQFudeuw1JnYY9A] deleting index
[2021-06-15T10:46:55,790][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-config/roiUOnKQTJewriDQ91fkGQ] deleting index
[2021-06-15T10:46:55,971][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [airline-data/OQq0HfjLTQK2HkrxqKk9GQ] deleting index
[2021-06-15T10:46:56,111][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [yamlRestTest-0] adding index lifecycle policy [.fleet-actions-results-ilm-policy]
[2021-06-15T10:46:56,209][WARN ][o.e.p.PersistentTasksClusterService] [yamlRestTest-0] trying to update state on non-existing task job-set-upgrade-mode-job
[2021-06-15T10:46:56,210][ERROR][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Could not set job state to [opened] for job [set-upgrade-mode-job]
org.elasticsearch.ResourceNotFoundException: the task with id job-set-upgrade-mode-job and allocation id 71 doesn't exist

Since we are seeing this problem for the first time after adding a call to reset in the node assignment chain it seems very likely that something needs changing there. Please can you have a look @dimitris-athanasiou? I think this will affect users in 7.14 if we don't fix it before release.

@dimitris-athanasiou
Copy link
Contributor

dimitris-athanasiou commented Jun 18, 2021

I took some time looking into this. I believe the problem is caused by the fact that something triggered a feature reset. Before the job reset starting we can see in the logs:

[2021-06-15T10:46:55,307][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset

This is what later kills the job process and, probably, causes the job to fail connecting to the process.

What could be triggering a feature reset in the middle of the test?

cc @benwtrent

@droberts195
Copy link
Contributor Author

What could be triggering a feature reset in the middle of the test?

It doesn't help that none of the 3 tests that failed ran at the time the root cause of the problem occurred. Looking at the full log of test start and stop times from the console log we see that the tests running around the time of the problem were:

  1> [2564-06-15T06:46:50,736][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=runtime_fields/30_double/terms agg] before test
  1> [2564-06-15T06:46:51,415][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=runtime_fields/30_double/terms agg] after test
  1> [2564-06-15T06:46:51,433][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/datafeeds_crud/Test put datafeed whose id is already taken] before test
  1> [2564-06-15T06:46:53,014][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/datafeeds_crud/Test put datafeed whose id is already taken] after test
  1> [2564-06-15T06:46:53,031][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade_mode to enabled] before test
  1> [2564-06-15T06:46:56,377][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=ml/set_upgrade_mode/Setting upgrade_mode to enabled] after test
  1> [2564-06-15T06:46:56,396][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=unsigned_long/40_different_numeric/Aggregation of different numeric types is supported] before test
  1> [2564-06-15T06:46:57,226][INFO ][o.e.x.t.r.XPackRestIT    ] [test] [p0=unsigned_long/40_different_numeric/Aggregation of different numeric types is supported] after test

Therefore, I think that a feature reset didn't run in the middle of a test, but a feature reset was cleaning up at the end of ml/set_upgrade_mode/Setting upgrade_mode to enabled.

That test is a bit unusual in that it ends with upgrade mode enabled, so our end-of-test cleanup then has to work while upgrade mode is enabled.

@droberts195
Copy link
Contributor Author

droberts195 commented Jun 18, 2021

Here's a bit more of the server side log (the bit that comes immediately before what's include in #74141 (comment)):

[2021-06-15T10:46:54,672][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-06-15T10:46:54,717][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/9768] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-06-15T10:46:54,883][INFO ][o.e.x.m.d.DatafeedJob    ] [yamlRestTest-0] [set-upgrade-mode-job] Datafeed started (from: 1970-01-01T00:00:00.000Z to: real-time) with frequency [600000ms]
[2021-06-15T10:46:54,888][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [true] from [false]
[2021-06-15T10:46:54,891][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Flushing job set-upgrade-mode-job
[2021-06-15T10:46:54,891][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Executing cluster state update
[2021-06-15T10:46:54,956][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Cluster update response built: true
[2021-06-15T10:46:54,956][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Enabling upgrade mode, must isolate datafeeds
[2021-06-15T10:46:54,956][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolating datafeeds: [set-upgrade-mode-job-datafeed]
[2021-06-15T10:46:54,960][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolated the datafeeds
[2021-06-15T10:46:54,960][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Un-assigning persistent tasks : [ datafeed-set-upgrade-mode-job-datafeed, job-set-upgrade-mode-job ]
[2021-06-15T10:46:55,011][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task cannot be assigned while upgrade mode is enabled.]
[2021-06-15T10:46:55,018][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] [68]
[2021-06-15T10:46:55,019][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]
[2021-06-15T10:46:55,019][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] try lock [0s] to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]...
[2021-06-15T10:46:55,019][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] stopping datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job], acquired [true]...
[2021-06-15T10:46:55,021][INFO ][o.e.x.m.d.DatafeedRunner ] [yamlRestTest-0] [task has been removed, cancelling locally] datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job] has been stopped
[2021-06-15T10:46:55,068][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [false]; reason = [task has been removed, cancelling locally]
[2021-06-15T10:46:55,068][INFO ][o.e.x.m.j.p.a.ProcessContext] [yamlRestTest-0] Killing job [set-upgrade-mode-job], because [task has been removed, cancelling locally]
[2021-06-15T10:46:55,071][INFO ][o.e.x.m.j.p.a.o.AutodetectResultProcessor] [yamlRestTest-0] [set-upgrade-mode-job] 0 buckets parsed from autodetect output
[2021-06-15T10:46:55,171][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Waited for tasks to be unassigned
[2021-06-15T10:46:55,171][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-06-15T10:46:55,193][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [false] from [true]
[2021-06-15T10:46:55,193][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Executing cluster state update
[2021-06-15T10:46:55,249][TRACE][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Cluster update response built: true

[2021-06-15T10:46:55,193][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Starting to set [upgrade_mode] to [false] from [true] is because there's a teardown section in set_upgrade_mode.yml that unsets upgrade mode.

That test is a bit unusual in that it ends with upgrade mode enabled, so our end-of-test cleanup then has to work while upgrade mode is enabled.

So actually that statement was wrong. After the teardown completes the end-of-test cleanup has to run with upgrade mode disabled, but only just disabled, and the effects of disabling it still playing out.

@droberts195
Copy link
Contributor Author

We discussed this more on Slack.

The problem appears to be that feature reset kills the job while the job persistent task executor is doing job reset before starting the process. That kill request seems like it is reporting success - fair enough as no process was running at the time. But then, after the job reset completes the job persistent task executor is moving on to start the C++ process. This then becomes an invisible ninja process that the Java side has no knowledge of, but which causes problems later on.

If a kill request is received while the job persistent task executor is going through its open sequence but before the process is running then it needs to not start the process. It needs to remember that the process was "killed" before it was even running and not start it in this situation.

@davidkyle
Copy link
Member

@droberts195
Copy link
Contributor Author

Muted on master in #74351

Although it was the first test in the suite to fail when this issue was first opened I think ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled is the wrong test to mute for this - see #74141 (comment):

Therefore, I think that a feature reset didn't run in the middle of a test, but a feature reset was cleaning up at the end of ml/set_upgrade_mode/Setting upgrade_mode to enabled.

What happens is that the cleanup after ml/set_upgrade_mode/Setting upgrade_mode to enabled causes an inconsistency, and then all the other tests in the suite that happen to run afterwards fail (and the test order is random, so it varies). So I think the one to mute is ml/set_upgrade_mode/Setting upgrade_mode to enabled.

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Jun 22, 2021
This commit checks if the job has been requested to close after
the reset action completes as part of allocating the job to a new node.
This ensures we do not proceed to start the job process even though
the job had been requested to close.

Closes elastic#74141
dimitris-athanasiou added a commit that referenced this issue Jun 22, 2021
While the job is opening it is possible that the kill process action is called.
If the kill process action is received before the job process has started,
we currently start the process anyway. The process will eventually timeout
to connect to anything and will exit. However, it may cause an unexpected
failure if the job is opened again as it won't be able to launch a process as
one would already exist.

This commit ensures the JobTask.isClosing() reports true when
the kill process action has been called in order to abort opening the
process.

Closes #74141
dimitris-athanasiou added a commit that referenced this issue Jun 22, 2021
…#74441)

While the job is opening it is possible that the kill process action is called.
If the kill process action is received before the job process has started,
we currently start the process anyway. The process will eventually timeout
to connect to anything and will exit. However, it may cause an unexpected
failure if the job is opened again as it won't be able to launch a process as
one would already exist.

This commit ensures the JobTask.isClosing() reports true when
the kill process action has been called in order to abort opening the
process.

Closes #74141

Backport of #74415
@droberts195
Copy link
Contributor Author

There's still a problem here.

The failure in https://gradle-enterprise.elastic.co/s/eaysuxbqtxyp4 is an evolution of the original race condition following the fix.

The server-side log shows this:

[2021-07-05T19:07:03,994][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Disabling upgrade mode, must wait for tasks to not have AWAITING_UPGRADE assignment
[2021-07-05T19:07:04,007][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [cannot start datafeed [set-upgrade-mode-job-datafeed], because the job's [set-upgrade-mode-job] state is stale]
[2021-07-05T19:07:04,009][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Done waiting for tasks to be out of AWAITING_UPGRADE
[2021-07-05T19:07:04,009][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Completed upgrade mode request
[2021-07-05T19:07:04,013][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
[2021-07-05T19:07:04,014][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; resetting as no snapshot exists
[2021-07-05T19:07:04,025][INFO ][o.e.x.m.a.TransportResetJobAction] [yamlRestTest-0] [set-upgrade-mode-job] Resetting job
[2021-07-05T19:07:04,033][INFO ][o.e.x.m.j.p.JobDataDeleter] [yamlRestTest-0] [set-upgrade-mode-job] running delete by query on [.ml-anomalies-shared]
[2021-07-05T19:07:04,075][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-07-05T19:07:04,122][INFO ][o.e.x.m.a.TransportResetJobAction] [yamlRestTest-0] [set-upgrade-mode-job] Reset has successfully completed
[2021-07-05T19:07:04,122][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job] 
[2021-07-05T19:07:04,125][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-07-05T19:07:04,125][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [kill process (api)] 
[2021-07-05T19:07:04,125][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Aborted opening job [set-upgrade-mode-job] as it has been closed
[2021-07-05T19:07:04,152][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Attempting to close job [set-upgrade-mode-job], because [close job (api)]
[2021-07-05T19:07:04,152][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Cannot close job [set-upgrade-mode-job] as it has already been closed or is closing

What has happened is that the test code of ml/set_upgrade_mode/Setting upgrade_mode to enabled ran without problems. Then the teardown section ran, and disabled upgrade mode, causing the job to reopen. Then the generic post-test cleanup code ran, which closes all the jobs, and this hung.

So it appears that we violate the "it's not an error to close a job twice" rule when we go through the "Cannot close job [xxx] as it has already been closed or is closing" code path.

@droberts195 droberts195 reopened this Jul 6, 2021
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jul 6, 2021
There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Fixes elastic#74141
droberts195 added a commit that referenced this issue Jul 7, 2021
There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Fixes #74141
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
4 participants