Skip to content

[ML] ResultsPersisterService may delay the node to close #65890

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dimitris-athanasiou opened this issue Dec 4, 2020 · 3 comments · Fixed by #65904
Closed

[ML] ResultsPersisterService may delay the node to close #65890

dimitris-athanasiou opened this issue Dec 4, 2020 · 3 comments · Fixed by #65904
Assignees
Labels
>bug :ml Machine learning

Comments

@dimitris-athanasiou
Copy link
Contributor

Our ResultsPersisterService achieves scheduling the next retry by calling Thread.sleep. This is dodgy as it blocks a thread and may cause the node to delay closing down. This has manifested in a CI failure in #65710.

We should refactor it to use the thread pool and schedule the next retry instead of sleeping.

@dimitris-athanasiou dimitris-athanasiou added >bug needs:triage Requires assignment of a team area label labels Dec 4, 2020
@dimitris-athanasiou dimitris-athanasiou added :ml Machine learning and removed needs:triage Requires assignment of a team area label labels Dec 4, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@dimitris-athanasiou
Copy link
Contributor Author

We should also investigate reusing RetryableAction if possible.

benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Dec 4, 2020
Having a thread sleep in a recurring action may cause issues on node shutdown.
What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts
can extend to a larger period of time, we should instead use scheduled tasks + the threadpool.
This allows the retries to be effectively canceled instead of waiting for a thread to wake back up.

closes elastic#65890
benwtrent added a commit that referenced this issue Dec 8, 2020
* [ML] remove thread sleep from results persister
Having a thread sleep in a recurring action may cause issues on node shutdown.
What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts
can extend to a larger period of time, we should instead use scheduled tasks + the threadpool.
This allows the retries to be effectively canceled instead of waiting for a thread to wake back up.

closes #65890
benwtrent added a commit to benwtrent/elasticsearch that referenced this issue Dec 8, 2020
* [ML] remove thread sleep from results persister
Having a thread sleep in a recurring action may cause issues on node shutdown.
What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts
can extend to a larger period of time, we should instead use scheduled tasks + the threadpool.
This allows the retries to be effectively canceled instead of waiting for a thread to wake back up.

closes elastic#65890
benwtrent added a commit that referenced this issue Dec 8, 2020
* [ML] remove thread sleep from results persister
Having a thread sleep in a recurring action may cause issues on node shutdown.
What if the thread is sleeping while a nice shutdown is occurring? Since these retry timeouts
can extend to a larger period of time, we should instead use scheduled tasks + the threadpool.
This allows the retries to be effectively canceled instead of waiting for a thread to wake back up.

closes #65890
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Dec 9, 2020
The test `testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown`
in `MlDistributedFailureIT` was failing due to the node timing out to
shut down. This was addressed in elastic#65904 and we should be able to unmute
the test.

Fixes elastic#65710
@dimitris-athanasiou
Copy link
Contributor Author

@benwtrent I hit another failure on this despite the fix that went in. It still looks like the results persister service keeps retrying despite the node closing. The failure is here https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/14332/testReport/junit/org.elasticsearch.xpack.ml.integration/MlDistributedFailureIT/testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown/.

Look for test-node-goes-down-while-running-job and notice how node_t3 keeps retrying to persist data_counts after we initiated shutdown for the node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants