[ML] ResultsPersisterService may delay the node to close #65890

dimitris-athanasiou · 2020-12-04T14:53:46Z

Our ResultsPersisterService achieves scheduling the next retry by calling Thread.sleep. This is dodgy as it blocks a thread and may cause the node to delay closing down. This has manifested in a CI failure in #65710.

We should refactor it to use the thread pool and schedule the next retry instead of sleeping.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-12-04T15:00:08Z

Pinging @elastic/ml-core (:ml)

dimitris-athanasiou · 2020-12-04T15:53:14Z

We should also investigate reusing RetryableAction if possible.

Having a thread sleep in a recurring action may cause issues on node shutdown. What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts can extend to a larger period of time, we should instead use scheduled tasks + the threadpool. This allows the retries to be effectively canceled instead of waiting for a thread to wake back up. closes elastic#65890

* [ML] remove thread sleep from results persister Having a thread sleep in a recurring action may cause issues on node shutdown. What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts can extend to a larger period of time, we should instead use scheduled tasks + the threadpool. This allows the retries to be effectively canceled instead of waiting for a thread to wake back up. closes #65890

* [ML] remove thread sleep from results persister Having a thread sleep in a recurring action may cause issues on node shutdown. What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts can extend to a larger period of time, we should instead use scheduled tasks + the threadpool. This allows the retries to be effectively canceled instead of waiting for a thread to wake back up. closes elastic#65890

* [ML] remove thread sleep from results persister Having a thread sleep in a recurring action may cause issues on node shutdown. What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts can extend to a larger period of time, we should instead use scheduled tasks + the threadpool. This allows the retries to be effectively canceled instead of waiting for a thread to wake back up. closes #65890

The test `testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown` in `MlDistributedFailureIT` was failing due to the node timing out to shut down. This was addressed in elastic#65904 and we should be able to unmute the test. Fixes elastic#65710

dimitris-athanasiou · 2020-12-09T10:37:51Z

@benwtrent I hit another failure on this despite the fix that went in. It still looks like the results persister service keeps retrying despite the node closing. The failure is here https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/14332/testReport/junit/org.elasticsearch.xpack.ml.integration/MlDistributedFailureIT/testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown/.

Look for test-node-goes-down-while-running-job and notice how node_t3 keeps retrying to persist data_counts after we initiated shutdown for the node.

dimitris-athanasiou added >bug needs:triage Requires assignment of a team area label labels Dec 4, 2020

dimitris-athanasiou assigned benwtrent Dec 4, 2020

dimitris-athanasiou added :ml Machine learning and removed needs:triage Requires assignment of a team area label labels Dec 4, 2020

benwtrent mentioned this issue Dec 4, 2020

[ML] remove thread sleep from results persister #65904

Merged

benwtrent closed this as completed in #65904 Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] ResultsPersisterService may delay the node to close #65890

[ML] ResultsPersisterService may delay the node to close #65890

dimitris-athanasiou commented Dec 4, 2020

elasticmachine commented Dec 4, 2020

dimitris-athanasiou commented Dec 4, 2020

dimitris-athanasiou commented Dec 9, 2020

[ML] ResultsPersisterService may delay the node to close #65890

[ML] ResultsPersisterService may delay the node to close #65890

Comments

dimitris-athanasiou commented Dec 4, 2020

elasticmachine commented Dec 4, 2020

dimitris-athanasiou commented Dec 4, 2020

dimitris-athanasiou commented Dec 9, 2020