-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[CI] ClassificationIT and RegressionIT testStopAndRestart failing on Windows #70698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/ml-core (Team:ML) |
This continues to fail, should we mute? I'm not sure if we have a good pattern for conditionally muting something on a given platform. |
I will add some logging. Can we please hold off muting until we get some logging for the failures? |
When a data frame analytics job is stopped because of a call to the _stop API, if the process is running it is killed. Depending on the OS, it may take some time to delete all the used named pipes. This means that in a scenario where the job is restarted immediately after it is possible that the old named pipes are used which results to the new process not properly communicating with java. This has been the underlying issue of elastic#70698 and elastic#67581. This commit fixes it by using unique identifiers for the named pipes. Closes elastic#70698
When a data frame analytics job is stopped because of a call to the _stop API, if the process is running it is killed. Depending on the OS, it may take some time to delete all the used named pipes. This means that in a scenario where the job is restarted immediately after it is possible that the old named pipes are used which results to the new process not properly communicating with java. This has been the underlying issue of #70698 and #67581. This commit fixes it by using unique identifiers for the named pipes. Closes #70698
…70926) When a data frame analytics job is stopped because of a call to the _stop API, if the process is running it is killed. Depending on the OS, it may take some time to delete all the used named pipes. This means that in a scenario where the job is restarted immediately after it is possible that the old named pipes are used which results to the new process not properly communicating with java. This has been the underlying issue of #70698 and #67581. This commit fixes it by using unique identifiers for the named pipes. Closes #70698 Backport of #70918
Awesome, thanks for the fix. It's funny how many weird timing issues we catch on Windows because everything generally runs slower. I suppose that's a feature not a bug. |
Unfortunately we got another failure of this after using uniquely named pipes. https://gradle-enterprise.elastic.co/s/eep2yodligwhg I will further investigate. |
This is still pretty consistently failing on Windows on |
Another instance on 7.x and Windows: https://gradle-enterprise.elastic.co/s/ydvmuwliqbu72 |
Thanks for the mutes @mark-vieira and @dnhatn. This test has been a pain on Windows. Back to the drawing board. |
@dimitris-athanasiou ClassificationIT failed again with your logging changes https://gradle-enterprise.elastic.co/s/7xuuvxnu3apbc |
Another one on master: https://gradle-enterprise.elastic.co/s/g3wok27dfhk2m |
@dimitris-athanasiou can you advise if we should mute this again on master or if you are still looking for clues via logging? |
@cbuescher I have already pushed the mute. |
When we flush the input stream the java side writes enough spaces to fill the input stream buffer. However, in the case of data frame analytics, this may cause the job to freeze. The reason is that java writes the data and flushes in the same thread that goes on to then restore the state. However, when c++ reads in the end-of-data control message, it stops reading from the stream and goes on to perform the analysis. If the 8KB of spaces do not fit in the OS buffer for the names pipe, the java side blocks. It never proceeds with restoring the state and this causes a job that is being restarted and has state to freeze. In elastic/ml-cpp#1881 the buffer has been reduced to 2KB. This means the buffer is smaller than the buffer of all supported OS. Note that it is 4KB on Windows. Thus in this commit we also reduce the number of spaces we write in order to flush the buffer to match that of the buffer size. Closes elastic#70698
When we flush the input stream the java side writes enough spaces to fill the input stream buffer. However, in the case of data frame analytics, this may cause the job to freeze. The reason is that java writes the data and flushes in the same thread that goes on to then restore the state. However, when c++ reads in the end-of-data control message, it stops reading from the stream and goes on to perform the analysis. If the 8KB of spaces do not fit in the OS buffer for the names pipe, the java side blocks. It never proceeds with restoring the state and this causes a job that is being restarted and has state to freeze. In elastic/ml-cpp#1881 the buffer has been reduced to 2KB. This means the buffer is smaller than the buffer of all supported OS. Note that it is 4KB on Windows. Thus in this commit we also reduce the number of spaces we write in order to flush the buffer to match that of the buffer size. Closes #70698
When we flush the input stream the java side writes enough spaces to fill the input stream buffer. However, in the case of data frame analytics, this may cause the job to freeze. The reason is that java writes the data and flushes in the same thread that goes on to then restore the state. However, when c++ reads in the end-of-data control message, it stops reading from the stream and goes on to perform the analysis. If the 8KB of spaces do not fit in the OS buffer for the names pipe, the java side blocks. It never proceeds with restoring the state and this causes a job that is being restarted and has state to freeze. In elastic/ml-cpp#1881 the buffer has been reduced to 2KB. This means the buffer is smaller than the buffer of all supported OS. Note that it is 4KB on Windows. Thus in this commit we also reduce the number of spaces we write in order to flush the buffer to match that of the buffer size. Closes elastic#70698 Backport of elastic#72412
When we flush the input stream the java side writes enough spaces to fill the input stream buffer. However, in the case of data frame analytics, this may cause the job to freeze. The reason is that java writes the data and flushes in the same thread that goes on to then restore the state. However, when c++ reads in the end-of-data control message, it stops reading from the stream and goes on to perform the analysis. If the 8KB of spaces do not fit in the OS buffer for the names pipe, the java side blocks. It never proceeds with restoring the state and this causes a job that is being restarted and has state to freeze. In elastic/ml-cpp#1881 the buffer has been reduced to 2KB. This means the buffer is smaller than the buffer of all supported OS. Note that it is 4KB on Windows. Thus in this commit we also reduce the number of spaces we write in order to flush the buffer to match that of the buffer size. Closes elastic#70698 Backport of elastic#72412
When we flush the input stream the java side writes enough spaces to fill the input stream buffer. However, in the case of data frame analytics, this may cause the job to freeze. The reason is that java writes the data and flushes in the same thread that goes on to then restore the state. However, when c++ reads in the end-of-data control message, it stops reading from the stream and goes on to perform the analysis. If the 8KB of spaces do not fit in the OS buffer for the names pipe, the java side blocks. It never proceeds with restoring the state and this causes a job that is being restarted and has state to freeze. In elastic/ml-cpp#1881 the buffer has been reduced to 2KB. This means the buffer is smaller than the buffer of all supported OS. Note that it is 4KB on Windows. Thus in this commit we also reduce the number of spaces we write in order to flush the buffer to match that of the buffer size. Closes #70698 Backport of #72412
When we flush the input stream the java side writes enough spaces to fill the input stream buffer. However, in the case of data frame analytics, this may cause the job to freeze. The reason is that java writes the data and flushes in the same thread that goes on to then restore the state. However, when c++ reads in the end-of-data control message, it stops reading from the stream and goes on to perform the analysis. If the 8KB of spaces do not fit in the OS buffer for the names pipe, the java side blocks. It never proceeds with restoring the state and this causes a job that is being restarted and has state to freeze. In elastic/ml-cpp#1881 the buffer has been reduced to 2KB. This means the buffer is smaller than the buffer of all supported OS. Note that it is 4KB on Windows. Thus in this commit we also reduce the number of spaces we write in order to flush the buffer to match that of the buffer size. Closes #70698 Backport of #72412
When we flush the input stream the java side writes enough spaces to fill the input stream buffer. However, in the case of data frame analytics, this may cause the job to freeze. The reason is that java writes the data and flushes in the same thread that goes on to then restore the state. However, when c++ reads in the end-of-data control message, it stops reading from the stream and goes on to perform the analysis. If the 8KB of spaces do not fit in the OS buffer for the names pipe, the java side blocks. It never proceeds with restoring the state and this causes a job that is being restarted and has state to freeze. In elastic/ml-cpp#1881 the buffer has been reduced to 2KB. This means the buffer is smaller than the buffer of all supported OS. Note that it is 4KB on Windows. Thus in this commit we also reduce the number of spaces we write in order to flush the buffer to match that of the buffer size. Closes #70698 Backport of #72412
This failed back to back on Windows.
Build scan:
https://gradle-enterprise.elastic.co/s/lkoaua3founn4/tests/:x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest/org.elasticsearch.xpack.ml.integration.ClassificationIT/testStopAndRestart
Reproduction line:
EPRODUCE WITH: gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:javaRestTest' --tests "org.elasticsearch.xpack.ml.integration.ClassificationIT.testStopAndRestart" -Dtests.seed=C0225758D7D53E28 -Dtests.security.manager=true -Dtests.locale=et-EE -Dtests.timezone=US/Arizona -Druntime.java=11
Applicable branches:
master
Reproduces locally?:
Didn't try
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ml.integration.ClassificationIT&tests.test=testStopAndRestart
Failure excerpt:
The text was updated successfully, but these errors were encountered: