Skip to content

[CI] ClassificationIT and RegressionIT testStopAndRestart failing on Windows #70698

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mark-vieira opened this issue Mar 22, 2021 · 14 comments · Fixed by #70918 or #72412
Closed

[CI] ClassificationIT and RegressionIT testStopAndRestart failing on Windows #70698

mark-vieira opened this issue Mar 22, 2021 · 14 comments · Fixed by #70918 or #72412
Assignees
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

This failed back to back on Windows.

Build scan:
https://gradle-enterprise.elastic.co/s/lkoaua3founn4/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.ClassificationIT/testStopAndRestart

Reproduction line:
EPRODUCE WITH: gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testStopAndRestart" -Dtests.seed=C0225758D7D53E28 -Dtests.security.manager=true -Dtests.locale=et-EE -Dtests.timezone=US/Arizona -Druntime.java=11

Applicable branches:
master

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ml.integration.ClassificationIT&tests.test=testStopAndRestart

Failure excerpt:

  java.lang.AssertionError: Stats were: {"id":"classification_stop_and_restart","state":"started","progress":[{"phase":"reindexing","progress_percent":100},{"phase":"loading_data","progress_percent":100},{"phase":"feature_selection","progress_percent":0},{"phase":"coarse_parameter_search","progress_percent":0},{"phase":"fine_tuning_parameters","progress_percent":0},{"phase":"final_training","progress_percent":0},{"phase":"writing_results","progress_percent":0},{"phase":"inference","progress_percent":0}],"data_counts":{"training_docs_count":350,"test_docs_count":0,"skipped_docs_count":0},"memory_usage":{"timestamp":1616450803027,"peak_usage_bytes":28542,"status":"ok"},"node":{"id":"4ltUeffQSGGFxbYNiQprQw","name":"javaRestTest-2","ephemeral_id":"5knpfoKGTg2kwsU0EEL8aw","transport_address":"127.0.0.1:53204","attributes":{"testattr":"test","ml.machine_memory":"103077892096","ml.max_open_jobs":"20","xpack.installed":"true","ml.max_jvm_size":"518979584"}},"assignment_explanation":""}
  Expected: <stopped>
       but: was <started>

  at __randomizedtesting.SeedInfo.seed([C0225758D7D53E28:B76AD5A6C33F3888]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertIsStopped(MlNativeDataFrameAnalyticsIntegTestCase.java:224)
  at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.lambda$waitUntilAnalyticsIsStopped$0(MlNativeDataFrameAnalyticsIntegTestCase.java:160)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:958)
  at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.waitUntilAnalyticsIsStopped(MlNativeDataFrameAnalyticsIntegTestCase.java:160)
  at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.waitUntilAnalyticsIsStopped(MlNativeDataFrameAnalyticsIntegTestCase.java:156)
  at org.elasticsearch.xpack.ml.integration.ClassificationIT.testStopAndRestart(ClassificationIT.java:508)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:566)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1758)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:946)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:982)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:824)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:475)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:955)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:840)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:891)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:902)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:375)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.lambda$forkTimeoutingTask$0(ThreadLeakControl.java:831)
  at java.lang.Thread.run(Thread.java:834)

@mark-vieira mark-vieira added :ml Machine learning >test-failure Triaged test failures from CI labels Mar 22, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Mar 22, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@mark-vieira mark-vieira changed the title [CI] ClassificationIT testStopAndRestart failing [CI] ClassificationIT testStopAndRestart failing on Windows Mar 22, 2021
@mark-vieira
Copy link
Contributor Author

This continues to fail, should we mute? I'm not sure if we have a good pattern for conditionally muting something on a given platform.

@dimitris-athanasiou
Copy link
Contributor

I will add some logging. Can we please hold off muting until we get some logging for the failures?

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Mar 26, 2021
When a data frame analytics job is stopped because of
a call to the _stop API, if the process is running it is killed.
Depending on the OS, it may take some time to delete all the
used named pipes. This means that in a scenario where the job
is restarted immediately after it is possible that the old
named pipes are used which results to the new process not
properly communicating with java.

This has been the underlying issue of elastic#70698 and elastic#67581.

This commit fixes it by using unique identifiers for the named
pipes.

Closes elastic#70698
dimitris-athanasiou added a commit that referenced this issue Mar 26, 2021
When a data frame analytics job is stopped because of
a call to the _stop API, if the process is running it is killed.
Depending on the OS, it may take some time to delete all the
used named pipes. This means that in a scenario where the job
is restarted immediately after it is possible that the old
named pipes are used which results to the new process not
properly communicating with java.

This has been the underlying issue of #70698 and #67581.

This commit fixes it by using unique identifiers for the named
pipes.

Closes #70698
dimitris-athanasiou added a commit that referenced this issue Mar 26, 2021
…70926)

When a data frame analytics job is stopped because of
a call to the _stop API, if the process is running it is killed.
Depending on the OS, it may take some time to delete all the
used named pipes. This means that in a scenario where the job
is restarted immediately after it is possible that the old
named pipes are used which results to the new process not
properly communicating with java.

This has been the underlying issue of #70698 and #67581.

This commit fixes it by using unique identifiers for the named
pipes.

Closes #70698

Backport of #70918
@mark-vieira
Copy link
Contributor Author

Awesome, thanks for the fix. It's funny how many weird timing issues we catch on Windows because everything generally runs slower. I suppose that's a feature not a bug.

@dimitris-athanasiou
Copy link
Contributor

Unfortunately we got another failure of this after using uniquely named pipes.

https://gradle-enterprise.elastic.co/s/eep2yodligwhg

I will further investigate.

@mark-vieira
Copy link
Contributor Author

@dnhatn muted this in 7.x with 29d10ac.

@mark-vieira
Copy link
Contributor Author

This is still pretty consistently failing on Windows on master as well so I've muted there with 38189bb.

@dnhatn
Copy link
Member

dnhatn commented Mar 30, 2021

Another instance on 7.x and Windows: https://gradle-enterprise.elastic.co/s/ydvmuwliqbu72

@dimitris-athanasiou dimitris-athanasiou changed the title [CI] ClassificationIT testStopAndRestart failing on Windows [CI] ClassificationIT and RegressionIT testStopAndRestart failing on Windows Mar 31, 2021
@dimitris-athanasiou
Copy link
Contributor

dimitris-athanasiou commented Mar 31, 2021

Thanks for the mutes @mark-vieira and @dnhatn. This test has been a pain on Windows. Back to the drawing board.

@davidkyle
Copy link
Member

@dimitris-athanasiou ClassificationIT failed again with your logging changes https://gradle-enterprise.elastic.co/s/7xuuvxnu3apbc

@davidkyle
Copy link
Member

@cbuescher
Copy link
Member

Another one on master: https://gradle-enterprise.elastic.co/s/g3wok27dfhk2m

@cbuescher
Copy link
Member

@dimitris-athanasiou can you advise if we should mute this again on master or if you are still looking for clues via logging?

@dimitris-athanasiou
Copy link
Contributor

@cbuescher I have already pushed the mute.

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Apr 28, 2021
When we flush the input stream the java side writes
enough spaces to fill the input stream buffer.
However, in the case of data frame analytics, this may
cause the job to freeze. The reason is that java
writes the data and flushes in the same thread that
goes on to then restore the state. However, when
c++ reads in the end-of-data control message, it stops
reading from the stream and goes on to perform the analysis.
If the 8KB of spaces do not fit in the OS buffer for the
names pipe, the java side blocks. It never proceeds with
restoring the state and this causes a job that is
being restarted and has state to freeze.

In elastic/ml-cpp#1881 the buffer has been reduced to 2KB.
This means the buffer is smaller than the buffer of all supported OS.
Note that it is 4KB on Windows.

Thus in this commit we also reduce the number of spaces we write
in order to flush the buffer to match that of the buffer size.

Closes elastic#70698
dimitris-athanasiou added a commit that referenced this issue Apr 29, 2021
When we flush the input stream the java side writes
enough spaces to fill the input stream buffer.
However, in the case of data frame analytics, this may
cause the job to freeze. The reason is that java
writes the data and flushes in the same thread that
goes on to then restore the state. However, when
c++ reads in the end-of-data control message, it stops
reading from the stream and goes on to perform the analysis.
If the 8KB of spaces do not fit in the OS buffer for the
names pipe, the java side blocks. It never proceeds with
restoring the state and this causes a job that is
being restarted and has state to freeze.

In elastic/ml-cpp#1881 the buffer has been reduced to 2KB.
This means the buffer is smaller than the buffer of all supported OS.
Note that it is 4KB on Windows.

Thus in this commit we also reduce the number of spaces we write
in order to flush the buffer to match that of the buffer size.

Closes #70698
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Apr 29, 2021
When we flush the input stream the java side writes
enough spaces to fill the input stream buffer.
However, in the case of data frame analytics, this may
cause the job to freeze. The reason is that java
writes the data and flushes in the same thread that
goes on to then restore the state. However, when
c++ reads in the end-of-data control message, it stops
reading from the stream and goes on to perform the analysis.
If the 8KB of spaces do not fit in the OS buffer for the
names pipe, the java side blocks. It never proceeds with
restoring the state and this causes a job that is
being restarted and has state to freeze.

In elastic/ml-cpp#1881 the buffer has been reduced to 2KB.
This means the buffer is smaller than the buffer of all supported OS.
Note that it is 4KB on Windows.

Thus in this commit we also reduce the number of spaces we write
in order to flush the buffer to match that of the buffer size.

Closes elastic#70698

Backport of elastic#72412
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Apr 29, 2021
When we flush the input stream the java side writes
enough spaces to fill the input stream buffer.
However, in the case of data frame analytics, this may
cause the job to freeze. The reason is that java
writes the data and flushes in the same thread that
goes on to then restore the state. However, when
c++ reads in the end-of-data control message, it stops
reading from the stream and goes on to perform the analysis.
If the 8KB of spaces do not fit in the OS buffer for the
names pipe, the java side blocks. It never proceeds with
restoring the state and this causes a job that is
being restarted and has state to freeze.

In elastic/ml-cpp#1881 the buffer has been reduced to 2KB.
This means the buffer is smaller than the buffer of all supported OS.
Note that it is 4KB on Windows.

Thus in this commit we also reduce the number of spaces we write
in order to flush the buffer to match that of the buffer size.

Closes elastic#70698

Backport of elastic#72412
dimitris-athanasiou added a commit that referenced this issue Apr 29, 2021
When we flush the input stream the java side writes
enough spaces to fill the input stream buffer.
However, in the case of data frame analytics, this may
cause the job to freeze. The reason is that java
writes the data and flushes in the same thread that
goes on to then restore the state. However, when
c++ reads in the end-of-data control message, it stops
reading from the stream and goes on to perform the analysis.
If the 8KB of spaces do not fit in the OS buffer for the
names pipe, the java side blocks. It never proceeds with
restoring the state and this causes a job that is
being restarted and has state to freeze.

In elastic/ml-cpp#1881 the buffer has been reduced to 2KB.
This means the buffer is smaller than the buffer of all supported OS.
Note that it is 4KB on Windows.

Thus in this commit we also reduce the number of spaces we write
in order to flush the buffer to match that of the buffer size.

Closes #70698

Backport of #72412
dimitris-athanasiou added a commit that referenced this issue Apr 29, 2021
When we flush the input stream the java side writes
enough spaces to fill the input stream buffer.
However, in the case of data frame analytics, this may
cause the job to freeze. The reason is that java
writes the data and flushes in the same thread that
goes on to then restore the state. However, when
c++ reads in the end-of-data control message, it stops
reading from the stream and goes on to perform the analysis.
If the 8KB of spaces do not fit in the OS buffer for the
names pipe, the java side blocks. It never proceeds with
restoring the state and this causes a job that is
being restarted and has state to freeze.

In elastic/ml-cpp#1881 the buffer has been reduced to 2KB.
This means the buffer is smaller than the buffer of all supported OS.
Note that it is 4KB on Windows.

Thus in this commit we also reduce the number of spaces we write
in order to flush the buffer to match that of the buffer size.

Closes #70698

Backport of #72412
dimitris-athanasiou added a commit that referenced this issue Apr 29, 2021
When we flush the input stream the java side writes
enough spaces to fill the input stream buffer.
However, in the case of data frame analytics, this may
cause the job to freeze. The reason is that java
writes the data and flushes in the same thread that
goes on to then restore the state. However, when
c++ reads in the end-of-data control message, it stops
reading from the stream and goes on to perform the analysis.
If the 8KB of spaces do not fit in the OS buffer for the
names pipe, the java side blocks. It never proceeds with
restoring the state and this causes a job that is
being restarted and has state to freeze.

In elastic/ml-cpp#1881 the buffer has been reduced to 2KB.
This means the buffer is smaller than the buffer of all supported OS.
Note that it is 4KB on Windows.

Thus in this commit we also reduce the number of spaces we write
in order to flush the buffer to match that of the buffer size.

Closes #70698

Backport of #72412
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
6 participants