Skip to content

[CI] XPackRestIT failing on various YAML tests #75069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
romseygeek opened this issue Jul 7, 2021 · 14 comments
Closed

[CI] XPackRestIT failing on various YAML tests #75069

romseygeek opened this issue Jul 7, 2021 · 14 comments
Assignees
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates low-risk An open issue or test failure that is a low risk to future releases :ml Machine learning Team:Data Management Meta label for data/management team Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@romseygeek
Copy link
Contributor

got unexpected warning header [	
	299 Elasticsearch-7.15.0-SNAPSHOT-f97d65771f221acca4b0a9a9441844a4392028fd "this request accesses system indices: [.ml-config], but in a future major version, direct access to system indices will be prevented by default"	
]

Build scan:
https://gradle-enterprise.elastic.co/s/svd6gy2xzpfn4/tests/:x-pack:plugin:yamlRestTest/org.elasticsearch.xpack.test.rest.XPackRestIT/test%20%7Bp0=data_stream%2F140_data_stream_aliases%2FCreate%20data%20stream%20alias%7D

Reproduction line:
./gradlew ':x-pack:plugin:yamlRestTest' --tests "org.elasticsearch.xpack.test.rest.XPackRestIT.test {p0=data_stream/140_data_stream_aliases/Create data stream alias}" -Dtests.seed=9F9D630801B00349 -Dtests.locale=it-IT -Dtests.timezone=Asia/Ashkhabad -Druntime.java=8

Applicable branches:
7.x, master

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.test.rest.XPackRestIT&tests.test=test%20%7Bp0%3Ddata_stream/140_data_stream_aliases/Create%20data%20stream%20alias%7D

Failure excerpt:

java.lang.AssertionError: Failure at [data_stream/140_data_stream_aliases:63]: got unexpected warning header [
	299 Elasticsearch-7.15.0-SNAPSHOT-f97d65771f221acca4b0a9a9441844a4392028fd "this request accesses system indices: [.ml-config], but in a future major version, direct access to system indices will be prevented by default"
]

  at __randomizedtesting.SeedInfo.seed([9F9D630801B00349:17C95CD2AF4C6EB1]:0)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:462)
  at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:439)
  at sun.reflect.GeneratedMethodAccessor15.invoke(null:-1)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at java.lang.Thread.run(Thread.java:748)

  Caused by: java.lang.AssertionError: got unexpected warning header [
  	299 Elasticsearch-7.15.0-SNAPSHOT-f97d65771f221acca4b0a9a9441844a4392028fd "this request accesses system indices: [.ml-config], but in a future major version, direct access to system indices will be prevented by default"
  ]

    at org.junit.Assert.fail(Assert.java:88)
    at org.elasticsearch.test.rest.yaml.section.DoSection.checkWarningHeaders(DoSection.java:476)
    at org.elasticsearch.test.rest.yaml.section.DoSection.execute(DoSection.java:350)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.executeSection(ESClientYamlSuiteTestCase.java:455)
    at org.elasticsearch.test.rest.yaml.ESClientYamlSuiteTestCase.test(ESClientYamlSuiteTestCase.java:439)
    at sun.reflect.GeneratedMethodAccessor15.invoke(null:-1)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
    at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
    at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
    at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at java.lang.Thread.run(Thread.java:748)

@romseygeek romseygeek added :ml Machine learning >test-failure Triaged test failures from CI labels Jul 7, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 7, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@romseygeek romseygeek added the :Data Management/Indices APIs APIs to create and manage indices and templates label Jul 7, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Jul 7, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@droberts195
Copy link
Contributor

I think #73734 would fix this. As an interim fix we should probably just ignore this warning.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
@romseygeek
Copy link
Contributor Author

romseygeek commented Jul 8, 2021

A related failure in docs yaml tests:
https://gradle-enterprise.elastic.co/s/g2vrkxovl2ws2

[2564-07-08T15:10:11,580][INFO ][o.e.s.DocsClientYamlTestSuiteIT] [test] [yaml=reference/query-dsl/exists-query/line_56] before test |  
-- | --
  | ก.ค. ๐๘, ๒๕๖๔ ๓:๑๐:๑๑ หลังเที่ยง org.elasticsearch.client.RestClient logResponse |  
  | WARNING: request [GET http://[::1]:43489/_search?error_trace=true] returned 1 warnings: [299 Elasticsearch-7.15.0-SNAPSHOT-a049585f298d10ed58c64ecac6902df5c4b86280 "this request accesses system indices: [.async-search], but in a future major version, direct access to system indices will be prevented by default"] |  
  | [

@droberts195
Copy link
Contributor

I think this is an evolution of #74141. After fixing some of the race conditions between ML job open, close and kill the races move on to code that runs later.

In the failure from this issue's description the relevant bit of the server side log is this:

[2021-07-07T14:19:35,848][INFO ][o.e.x.m.MachineLearning  ] [yamlRestTest-0] Starting machine learning feature reset
[2021-07-07T14:19:35,850][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Loading model snapshot [N/A], job latest_record_timestamp [N/A]
[2021-07-07T14:19:35,919][INFO ][o.e.x.m.a.TransportKillProcessAction] [yamlRestTest-0] [set-upgrade-mode-job] Killing job
[2021-07-07T14:19:35,919][TRACE][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] [set-upgrade-mode-job] Killing process: awaitCompletion = [true]; reason = [kill process (api)]
[2021-07-07T14:19:35,975][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-state-000001/wlkDnNKlQrqyEX_SoHKDcg] deleting index
[2021-07-07T14:19:35,975][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-notifications-000002/UdDUIn_vSGW4XpL0OXCb-w] deleting index
[2021-07-07T14:19:35,975][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-config/-HtKKmzATrqhExujwSN9BQ] deleting index
[2021-07-07T14:19:35,975][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-anomalies-shared/ZP7tVtBfQCeUXamDP4Tj7w] deleting index
[2021-07-07T14:19:35,976][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [.ml-annotations-6/lpA4DM6aRauiVpREJX9vhw] deleting index
[2021-07-07T14:19:36,130][INFO ][o.e.c.m.MetadataDeleteIndexService] [yamlRestTest-0] [airline-data/Zqw03jwRQFmmK-DuatKkGg] deleting index
[2021-07-07T14:19:36,266][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [yamlRestTest-0] adding index lifecycle policy [.fleet-actions-results-ilm-policy]
[2021-07-07T14:19:36,318][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [yamlRestTest-0] adding index lifecycle policy [synthetics]
[2021-07-07T14:19:36,368][INFO ][o.e.x.i.a.TransportPutLifecycleAction] [yamlRestTest-0] adding index lifecycle policy [.deprecation-indexing-ilm-policy]
[2021-07-07T14:19:36,392][INFO ][o.e.x.m.j.p.a.AutodetectProcessManager] [yamlRestTest-0] Closing job [set-upgrade-mode-job], because [job is already closing]
[2021-07-07T14:19:36,396][WARN ][o.e.x.m.j.p.a.o.AutodetectResultProcessor] [yamlRestTest-0] [set-upgrade-mode-job] some results not processed due to the process being killed
[2021-07-07T14:19:36,397][INFO ][o.e.x.m.j.p.a.AutodetectCommunicator] [yamlRestTest-0] [set-upgrade-mode-job] job closed
[2021-07-07T14:19:36,407][INFO ][o.e.c.m.MetadataCreateIndexService] [yamlRestTest-0] [.ml-config] creating index, cause [auto(update api)], templates [], shards [1]/[1]
[2021-07-07T14:19:36,408][INFO ][o.e.c.r.a.AllocationService] [yamlRestTest-0] updating number_of_replicas to [0] for indices [.ml-config]
[2021-07-07T14:19:36,472][INFO ][o.e.c.m.MetadataCreateIndexService] [yamlRestTest-0] [.ml-annotations-6] creating index, cause [api], templates [], shards [1]/[1]
[2021-07-07T14:19:36,473][INFO ][o.e.c.r.a.AllocationService] [yamlRestTest-0] updating number_of_replicas to [0] for indices [.ml-annotations-6]

Or, in plain English:

  1. Job is opening at the time the test ends
  2. Feature cleanup starts at the end of the test
  3. Job gets killed before its C++ process has started
  4. Indices get deleted
  5. Job's C++ process starts and we realise that it shouldn't be running because it was requested to be killed before it started
  6. We shut down the process again immediately
  7. Because of the way we shut down the process it tries to update documents in indices, causing these indices to be recreated after they've been deleted

#75113 should fix this by changing the way we shut down the process in step 7 so that it should be free of side-effects.

@droberts195 droberts195 self-assigned this Jul 8, 2021
elasticsearchmachine pushed a commit that referenced this issue Jul 8, 2021
This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this issue Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this issue Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine added a commit that referenced this issue Jul 8, 2021
…5116)

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

Co-authored-by: David Roberts <[email protected]>
elasticsearchmachine added a commit that referenced this issue Jul 8, 2021
…5117)

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

Co-authored-by: David Roberts <[email protected]>
@droberts195
Copy link
Contributor

Another one just happened in a docs test: https://gradle-enterprise.elastic.co/s/q6lylbs4wg3we

However, this one is for the .async-search index, not .ml-config, and I don't think ML is responsible for this one.

The relevant bit of the server-side log is:

[2021-07-08T13:14:21,548][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.async-search] creating index, cause [api], templates [], shards [1]/[0]
[2021-07-08T13:14:21,636][INFO ][o.e.c.m.MetadataDeleteIndexService] [node-0] [library/wnvB3FnqThqM1VDYR62u4A] deleting index
[2021-07-08T13:14:21,637][INFO ][o.e.c.m.MetadataDeleteIndexService] [node-0] [.async-search/-C8hXOI_QOuU488GHnhuKQ] deleting index
[2021-07-08T13:14:21,704][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [.async-search] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]
[2021-07-08T13:14:21,819][INFO ][o.e.c.m.MetadataMappingService] [node-0] [.async-search/mkUDT9-hRDqacmjcBBo8aQ] create_mapping [_doc]
[2021-07-08T13:14:22,008][INFO ][o.e.c.m.MetadataCreateIndexService] [node-0] [sales] creating index, cause [api], templates [], shards [1]/[1]
[2021-07-08T13:14:22,180][INFO ][o.e.c.m.MetadataDeleteIndexService] [node-0] [.async-search/mkUDT9-hRDqacmjcBBo8aQ] deleting index

So it looks like the .async-search index is being auto-created after it's been deleted after the previous test, then gets deleted again by the "delete all indices" cleanup code, yet this triggers a deprecation warning that fails the test.

@droberts195
Copy link
Contributor

This is another one that failed due to .async-search being a system index: https://gradle-enterprise.elastic.co/s/udfnzxfgyd37k

In this case the test that failed was reference/query-dsl/match-query/line_219. However, if you look in the full console log you can see that the test that ran immediately before was reference/sql/endpoints/rest/line_576. That was added recently in #74975 and uses async search, so that is what caused the .async-search index to be created at an unexpected moment. So this explains why these have started failing recently.

@williamrandolph do you think you would be able to merge #73734 soon with all the nice-to-haves deferred to a followup? I think that would be good enough for the docs tests. The only other solution would be to disable snippet testing for all the snippets added in #74975 that use async search.

/cc @jrodewig

@droberts195 droberts195 removed their assignment Jul 8, 2021
@cbuescher
Copy link
Member

I see a lot of failures in yaml tests this this "unexpected warning header" today, its one of the biggest sources of failures at the moment (although the overall frequency is bareable), so I'm +1 on gettin #73734 in as soon as @williamrandolph thinks its possible to get the noise level down.

@jrodewig
Copy link
Contributor

Thanks for the ping here. #73734 is a better long-term fix, but I've merged #75237 to mute snippet tests and reduce noise in the meantime.

@williamrandolph
Copy link
Contributor

@droberts195 Thanks for the ping. I will bring PR #73734 up to date and merge it as soon as I can.

@dakrone
Copy link
Member

dakrone commented Dec 10, 2021

@williamrandolph are you still planning to merge #73734 to address this?

@williamrandolph
Copy link
Contributor

Yes, now that we've finished our features for 8.0, I need to take another pass at that PR. Thanks for the reminder.

@droberts195 droberts195 added the low-risk An open issue or test failure that is a low risk to future releases label Oct 9, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@joegallo joegallo self-assigned this Mar 8, 2024
joegallo added a commit to joegallo/elasticsearch that referenced this issue Mar 8, 2024
@joegallo
Copy link
Contributor

joegallo commented Mar 8, 2024

I talked to @williamrandolph and between #94130 and #94191 we think this one's probably closable. I put up #106140 to revert #75237, and once that's merged I'm going to close this ticket.

joegallo added a commit that referenced this issue Mar 8, 2024
@joegallo joegallo closed this as completed Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates low-risk An open issue or test failure that is a low risk to future releases :ml Machine learning Team:Data Management Meta label for data/management team Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

9 participants