-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] ResultsPersisterService may delay the node to close #65890
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/ml-core (:ml) |
We should also investigate reusing |
Having a thread sleep in a recurring action may cause issues on node shutdown. What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts can extend to a larger period of time, we should instead use scheduled tasks + the threadpool. This allows the retries to be effectively canceled instead of waiting for a thread to wake back up. closes elastic#65890
* [ML] remove thread sleep from results persister Having a thread sleep in a recurring action may cause issues on node shutdown. What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts can extend to a larger period of time, we should instead use scheduled tasks + the threadpool. This allows the retries to be effectively canceled instead of waiting for a thread to wake back up. closes #65890
* [ML] remove thread sleep from results persister Having a thread sleep in a recurring action may cause issues on node shutdown. What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts can extend to a larger period of time, we should instead use scheduled tasks + the threadpool. This allows the retries to be effectively canceled instead of waiting for a thread to wake back up. closes elastic#65890
* [ML] remove thread sleep from results persister Having a thread sleep in a recurring action may cause issues on node shutdown. What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts can extend to a larger period of time, we should instead use scheduled tasks + the threadpool. This allows the retries to be effectively canceled instead of waiting for a thread to wake back up. closes #65890
The test `testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown` in `MlDistributedFailureIT` was failing due to the node timing out to shut down. This was addressed in elastic#65904 and we should be able to unmute the test. Fixes elastic#65710
@benwtrent I hit another failure on this despite the fix that went in. It still looks like the results persister service keeps retrying despite the node closing. The failure is here https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/14332/testReport/junit/org.elasticsearch.xpack.ml.integration/MlDistributedFailureIT/testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown/. Look for |
Our
ResultsPersisterService
achieves scheduling the next retry by callingThread.sleep
. This is dodgy as it blocks a thread and may cause the node to delay closing down. This has manifested in a CI failure in #65710.We should refactor it to use the thread pool and schedule the next retry instead of sleeping.
The text was updated successfully, but these errors were encountered: