Skip to content

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing #74497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
droberts195 opened this issue Jun 23, 2021 · 3 comments
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

This is different to #74141. In that case a zombie process was getting left behind. In this case the process is killed but the feature reset API never responds to its caller, leading to an exception during cleanup after the test.

Build scan:
https://gradle-enterprise.elastic.co/s/kazbfwxoxbv7q/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=ml%2Fset_upgrade_mode%2FSetting%20upgrade%20mode%20to%20disabled%20from%20enabled%7D

Reproduction line:
./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled}" -Dtests.seed=FE10E6AAED50C1A7 -Dtests.locale=hi-IN -Dtests.timezone=America/Nipigon -Druntime.java=8

Applicable branches:
7.13

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.test.rest.XPackRestIT&tests.test=test%20%7Bp0%3Dml/set_upgrade_mode/Setting%20upgrade%20mode%20to%20disabled%20from%20enabled%7D

Failure excerpt:

java.net.SocketTimeoutException: ६०,००० milliseconds timeout on connection http-outgoing-0 [ACTIVE]

  at __randomizedtesting.SeedInfo.seed([FE10E6AAED50C1A7:7644D97043ACAC5F]:0)
  at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:866)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:283)
  at org.elasticsearch.client.RestClient.performRequest(RestClient.java:270)
  at org.elasticsearch.xpack.core.ml.integration.MlRestTestStateCleaner.clearMlMetadata(MlRestTestStateCleaner.java:36)
  at org.elasticsearch.xpack.test.rest.AbstractXPackRestTest.clearMlState(AbstractXPackRestTest.java:222)
  at org.elasticsearch.xpack.test.rest.AbstractXPackRestTest.cleanup(AbstractXPackRestTest.java:206)
  at sun.reflect.GeneratedMethodAccessor16.invoke(null:-1)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at java.lang.Thread.run(Thread.java:748)

  Caused by: java.net.SocketTimeoutException: ६०,००० milliseconds timeout on connection http-outgoing-0 [ACTIVE]

    at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
    at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92)
    at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39)
    at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
    at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
    at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
    at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
    at java.lang.Thread.run(Thread.java:748)

@droberts195 droberts195 added :ml Machine learning >test-failure Triaged test failures from CI labels Jun 23, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jun 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

at org.elasticsearch.xpack.core.ml.integration.MlRestTestStateCleaner.clearMlMetadata(MlRestTestStateCleaner.java:36)

This is the line of the test cleanup that calls the feature reset API. So the reason the test was reported as having failed was that the feature reset API running in the external cluster never responded to its caller in the test harness.

[2021-06-23T10:32:02,308][INFO ][o.e.x.t.r.XPackRestIT ] [test] There are still tasks running after this test that might break subsequent tests [cluster:admin/features/reset, cluster:admin/xpack/ml/job/close, xpack/ml/job[c]].

This suggests that the reason the feature reset API never responded is that it was waiting for a close job request to complete, but that close job request never completed.

The server side log for the relevant period is this:

[2021-06-23T14:31:00,987][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job]
[2021-06-23T14:31:00,988][INFO ][o.e.x.c.m.u.MlIndexAndAlias] [yamlRestTest-0] About to create first concrete index [.ml-state-000001] with alias [.ml-state-write]
[2021-06-23T14:31:00,991][INFO ][o.e.c.m.MetadataCreateIndexService] [yamlRestTest-0] [.ml-state-000001] creating index, cause [api], templates [.ml-state], shards [1]/[1]
[2021-06-23T14:31:00,991][INFO ][o.e.c.r.a.AllocationService] [yamlRestTest-0] updating number_of_replicas to [0] for indices [.ml-state-000001]
[2021-06-23T14:31:01,008][INFO ][o.e.x.i.IndexLifecycleTransition] [yamlRestTest-0] moving index [.ml-state-000001] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [ml-size-based-ilm-policy]
[2021-06-23T14:31:01,024][INFO ][o.e.c.r.a.AllocationService] [yamlRestTest-0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.ml-state-000001][0]]]).
[2021-06-23T14:31:01,044][INFO ][o.e.x.i.IndexLifecycleTransition] [yamlRestTest-0] moving index [.ml-state-000001] from [{"phase":"new","action":"complete","name":"complete"}] to [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}] in policy [ml-size-based-ilm-policy]
[2021-06-23T14:31:01,049][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-06-23T14:31:01,055][INFO ][o.e.x.i.IndexLifecycleTransition] [yamlRestTest-0] moving index [.ml-state-000001] from [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}] to [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}] in policy [ml-size-based-ilm-policy]
[2021-06-23T14:31:01,630][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [set-upgrade-mode-job] [autodetect/110452] [CResourceMonitor.cc@77] Setting model memory limit to 1024 MB
[2021-06-23T14:31:01,651][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Successfully set job state to [opened] for job [set-upgrade-mode-job]
[2021-06-23T14:31:01,765][INFO ][o.e.x.m.d.DatafeedJob    ] [yamlRestTest-0] [set-upgrade-mode-job] Datafeed started (from: 1970-01-01T00:00:00.000Z to: real-time) with frequency [600000ms]
[2021-06-23T14:31:01,776][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Flushing job set-upgrade-mode-job
[2021-06-23T14:31:01,785][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Isolating datafeeds: [set-upgrade-mode-job-datafeed]
[2021-06-23T14:31:01,791][INFO ][o.e.x.m.a.TransportSetUpgradeModeAction] [yamlRestTest-0] Un-assigning persistent tasks : [ datafeed-set-upgrade-mode-job-datafeed, job-set-upgrade-mode-job ]
[2021-06-23T14:31:01,806][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task cannot be assigned while upgrade mode is enabled.]
[2021-06-23T14:31:01,808][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] [160]
[2021-06-23T14:31:01,808][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] attempt to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]
[2021-06-23T14:31:01,808][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] try lock [0s] to stop datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job]...
[2021-06-23T14:31:01,808][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] stopping datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job], acquired [true]...
[2021-06-23T14:31:01,810][INFO ][o.e.x.m.d.DatafeedManager] [yamlRestTest-0] [task has been removed, cancelling locally] datafeed [set-upgrade-mode-job-datafeed] for job [set-upgrade-mode-job] has been stopped
[2021-06-23T14:31:01,827][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [false]; reason = [task has been removed, cancelling locally]
[2021-06-23T14:31:01,848][INFO ][o.e.x.m.j.p.a.o.AutodetectResultProcessor] [yamlRestTest-0] [set-upgrade-mode-job] 0 buckets parsed from autodetect output
[2021-06-23T14:31:01,870][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [yamlRestTest-0] [controller/14079] [CDetachedProcessSpawner.cc@177] Child process with PID 110452 was terminated by signal 15
[2021-06-23T14:31:02,038][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [cannot start datafeed [set-upgrade-mode-job-datafeed], because the job's [set-upgrade-mode-job] state is stale]
[2021-06-23T14:31:02,039][INFO ][o.e.x.m.j.t.OpenJobPersistentTasksExecutor] [yamlRestTest-0] [set-upgrade-mode-job] job has running datafeed task; reverting to current snapshot
[2021-06-23T14:31:02,044][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] Reverting to snapshot 'empty'
[2021-06-23T14:31:02,068][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
[2021-06-23T14:31:02,138][INFO ][o.e.c.m.MetadataMappingService] [yamlRestTest-0] [.ml-anomalies-shared/9m4d6kobTYG127TeDFcLdA] update_mapping [_doc]
[2021-06-23T14:31:02,255][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] [set-upgrade-mode-job] Removing intervening records after reverting model: deleting results after [1970-01-01T00:00:00.000+0000]
[2021-06-23T14:31:02,257][INFO ][o.e.x.m.a.TransportRevertModelSnapshotAction] [yamlRestTest-0] [set-upgrade-mode-job] Removing intervening annotations after reverting model: deleting annotations after [1970-01-01T00:00:00.000+0000]
[2021-06-23T14:31:02,259][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Opening job [set-upgrade-mode-job]
[2021-06-23T14:31:02,266][WARN ][o.e.x.m.MlAssignmentNotifier] [yamlRestTest-0] [set-upgrade-mode-job] No node found to start datafeed [set-upgrade-mode-job-datafeed]. Reasons [persistent task will not be assigned as a feature reset is in progress.]
[2021-06-23T14:31:02,279][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-06-23T14:31:02,279][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [null]
[2021-06-23T14:31:02,279][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Aborted opening job [set-upgrade-mode-job] as it has been closed
[2021-06-23T14:31:02,304][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Attempting to close job [set-upgrade-mode-job], because [close job (api)]
[2021-06-23T14:31:02,304][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Cannot close job [set-upgrade-mode-job] as it has already been closed or is closing
[2021-06-23T14:32:02,193][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-state-000001/FXSe2_vsS3qriaJZ7xpegQ] deleting index
[2021-06-23T14:32:02,193][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-6/1ufMG9gFSqSz-NfW0pRigw] deleting index
[2021-06-23T14:32:02,193][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-config/aAmlvMg6Qh-KqZJIYHQ5uw] deleting index
[2021-06-23T14:32:02,194][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-anomalies-shared/9m4d6kobTYG127TeDFcLdA] deleting index
[2021-06-23T14:32:02,194][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [airline-data/ircHAX_FSROSZdfE28k_mQ] deleting index
[2021-06-23T14:32:02,194][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-notifications-000001/ed-gG-cgSJm8ul97Pn_eaw] deleting index

In particular this is of interest:

[2021-06-23T14:31:02,304][DEBUG][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Cannot close job [set-upgrade-mode-job] as it has already been closed or is closing

Possibly we have a bug where if we return early from the AutodetectProcessManager close sequence due to this situation then we don't remove the persistent task.

@droberts195
Copy link
Contributor Author

Possibly we have a bug where if we return early from the AutodetectProcessManager close sequence due to this situation then we don't remove the persistent task.

This should have been fixed by the work that had to be done to get CI to pass with the changes of #75188.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

2 participants