[ML] Fix ML memory tracker lockup when inner step fails #44158

droberts195 · 2019-07-10T10:35:46Z

When the ML memory tracker is refreshed and a refresh is
already in progress the idea is that the second and
subsequent refresh requests receive the same response as
the currently in progress refresh.

There was a bug that if a refresh failed then the ML
memory tracker's view of whether a refresh was in progress
was not reset, leading to every subsequent request being
registered to receive a response that would never come.

This change makes the ML memory tracker pass on failures
as well as successes to all interested parties and reset
the list of interested parties so that further refresh
attempts are possible after either a success or failure.

This fixes problem 1 of #44156

When the ML memory tracker is refreshed and a refresh is already in progress the idea is that the second and subsequent refresh requests receive the same response as the currently in progress refresh. There was a bug that if a refresh failed then the ML memory tracker's view of whether a refresh was in progress was not reset, leading to every subsequent request being registered to receive a response that would never come. This change makes the ML memory tracker pass on failures as well as successes to all interested parties and reset the list of interested parties so that further refresh attempts are possible after either a success or failure. This fixes problem 1 of elastic#44156

elasticmachine · 2019-07-10T10:35:48Z

Pinging @elastic/ml-core

davidkyle

LGTM

davidkyle · 2019-07-10T11:15:17Z

run elasticsearch-ci/1

^ docs test failure

benwtrent · 2019-07-10T12:43:04Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/process/MlMemoryTracker.java

+            synchronized (fullRefreshCompletionListeners) {
+                assert fullRefreshCompletionListeners.isEmpty() == false;
+                for (ActionListener<Void> listener : fullRefreshCompletionListeners) {
+                    listener.onFailure(e);


I was going to comment on how we don't signal onCompletion any more, but then I saw line 286. This is definitely a sneaky bug.

Yes, as well as making it impossible to retry the bug also meant that only 1 of the queuing listeners got notified of failures. The others didn't receive any notification at all.

When the ML memory tracker is refreshed and a refresh is already in progress the idea is that the second and subsequent refresh requests receive the same response as the currently in progress refresh. There was a bug that if a refresh failed then the ML memory tracker's view of whether a refresh was in progress was not reset, leading to every subsequent request being registered to receive a response that would never come. This change makes the ML memory tracker pass on failures as well as successes to all interested parties and reset the list of interested parties so that further refresh attempts are possible after either a success or failure. This fixes problem 1 of #44156

droberts195 added >bug :ml Machine learning v8.0.0 v7.3.0 v6.8.2 v7.2.1 v7.4.0 labels Jul 10, 2019

droberts195 requested a review from davidkyle July 10, 2019 10:35

davidkyle approved these changes Jul 10, 2019

View reviewed changes

droberts195 added v7.2.2 and removed v7.2.1 labels Jul 10, 2019

Merge branch 'master' into fix_ml_memory_tracker_on_failure

9ee74c9

benwtrent approved these changes Jul 10, 2019

View reviewed changes

droberts195 merged commit 7ae57d6 into elastic:master Jul 10, 2019

droberts195 deleted the fix_ml_memory_tracker_on_failure branch July 10, 2019 14:46

droberts195 mentioned this pull request Jul 11, 2019

[ML] Restarting a node while a data frame analytics job is running breaks MLMemoryTracker #44156

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Fix ML memory tracker lockup when inner step fails #44158

[ML] Fix ML memory tracker lockup when inner step fails #44158

Uh oh!

droberts195 commented Jul 10, 2019

Uh oh!

elasticmachine commented Jul 10, 2019

Uh oh!

davidkyle left a comment

Uh oh!

davidkyle commented Jul 10, 2019

Uh oh!

benwtrent Jul 10, 2019

Uh oh!

droberts195 Jul 10, 2019

Uh oh!

Uh oh!

[ML] Fix ML memory tracker lockup when inner step fails #44158

[ML] Fix ML memory tracker lockup when inner step fails #44158

Uh oh!

Conversation

droberts195 commented Jul 10, 2019

Uh oh!

elasticmachine commented Jul 10, 2019

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

davidkyle commented Jul 10, 2019

Uh oh!

benwtrent Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

droberts195 Jul 10, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!