-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Fix ML memory tracker lockup when inner step fails #44158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fix ML memory tracker lockup when inner step fails #44158
Conversation
When the ML memory tracker is refreshed and a refresh is already in progress the idea is that the second and subsequent refresh requests receive the same response as the currently in progress refresh. There was a bug that if a refresh failed then the ML memory tracker's view of whether a refresh was in progress was not reset, leading to every subsequent request being registered to receive a response that would never come. This change makes the ML memory tracker pass on failures as well as successes to all interested parties and reset the list of interested parties so that further refresh attempts are possible after either a success or failure. This fixes problem 1 of elastic#44156
Pinging @elastic/ml-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
run elasticsearch-ci/1 ^ docs test failure |
synchronized (fullRefreshCompletionListeners) { | ||
assert fullRefreshCompletionListeners.isEmpty() == false; | ||
for (ActionListener<Void> listener : fullRefreshCompletionListeners) { | ||
listener.onFailure(e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going to comment on how we don't signal onCompletion
any more, but then I saw line 286. This is definitely a sneaky bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as well as making it impossible to retry the bug also meant that only 1 of the queuing listeners got notified of failures. The others didn't receive any notification at all.
When the ML memory tracker is refreshed and a refresh is already in progress the idea is that the second and subsequent refresh requests receive the same response as the currently in progress refresh. There was a bug that if a refresh failed then the ML memory tracker's view of whether a refresh was in progress was not reset, leading to every subsequent request being registered to receive a response that would never come. This change makes the ML memory tracker pass on failures as well as successes to all interested parties and reset the list of interested parties so that further refresh attempts are possible after either a success or failure. This fixes problem 1 of #44156
When the ML memory tracker is refreshed and a refresh is already in progress the idea is that the second and subsequent refresh requests receive the same response as the currently in progress refresh. There was a bug that if a refresh failed then the ML memory tracker's view of whether a refresh was in progress was not reset, leading to every subsequent request being registered to receive a response that would never come. This change makes the ML memory tracker pass on failures as well as successes to all interested parties and reset the list of interested parties so that further refresh attempts are possible after either a success or failure. This fixes problem 1 of #44156
When the ML memory tracker is refreshed and a refresh is already in progress the idea is that the second and subsequent refresh requests receive the same response as the currently in progress refresh. There was a bug that if a refresh failed then the ML memory tracker's view of whether a refresh was in progress was not reset, leading to every subsequent request being registered to receive a response that would never come. This change makes the ML memory tracker pass on failures as well as successes to all interested parties and reset the list of interested parties so that further refresh attempts are possible after either a success or failure. This fixes problem 1 of #44156
When the ML memory tracker is refreshed and a refresh is already in progress the idea is that the second and subsequent refresh requests receive the same response as the currently in progress refresh. There was a bug that if a refresh failed then the ML memory tracker's view of whether a refresh was in progress was not reset, leading to every subsequent request being registered to receive a response that would never come. This change makes the ML memory tracker pass on failures as well as successes to all interested parties and reset the list of interested parties so that further refresh attempts are possible after either a success or failure. This fixes problem 1 of #44156
When the ML memory tracker is refreshed and a refresh is
already in progress the idea is that the second and
subsequent refresh requests receive the same response as
the currently in progress refresh.
There was a bug that if a refresh failed then the ML
memory tracker's view of whether a refresh was in progress
was not reset, leading to every subsequent request being
registered to receive a response that would never come.
This change makes the ML memory tracker pass on failures
as well as successes to all interested parties and reset
the list of interested parties so that further refresh
attempts are possible after either a success or failure.
This fixes problem 1 of #44156