Skip to content

[ML] Fix race condition between job opening and feature reset #74976

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 7, 2021

Conversation

droberts195
Copy link
Contributor

There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Fixes #74141

There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Fixes elastic#74141
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor Author

Although there are no explicit tests to prove this fix works, we do a feature reset at the end of every integration test, so it gets a lot of test coverage that way.

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit ace2988 into elastic:master Jul 7, 2021
@droberts195 droberts195 deleted the fix_upgrade_mode_test_again branch July 7, 2021 10:04
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jul 7, 2021
There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Backport of elastic#74976
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jul 7, 2021
There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Backport of elastic#74976
elasticsearchmachine pushed a commit that referenced this pull request Jul 7, 2021
There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Backport of #74976
elasticsearchmachine pushed a commit that referenced this pull request Jul 7, 2021
There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Backport of #74976
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine pushed a commit that referenced this pull request Jul 8, 2021
This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine added a commit that referenced this pull request Jul 8, 2021
…5116)

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

Co-authored-by: David Roberts <[email protected]>
elasticsearchmachine added a commit that referenced this pull request Jul 8, 2021
…5117)

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

Co-authored-by: David Roberts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v7.14.0 v7.15.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] XPackRestIT test {p0=ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled} failing
5 participants