Skip to content

[ML] Fix race condition between job open, close and kill #75113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

droberts195
Copy link
Contributor

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
@droberts195 droberts195 added >bug :ml Machine learning v8.0.0 auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged v7.14.1 v7.15.0 labels Jul 8, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 8, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elasticsearchmachine elasticsearchmachine merged commit 0f493c3 into elastic:master Jul 8, 2021
@droberts195 droberts195 deleted the another_close_kill_race branch July 8, 2021 10:19
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
elasticsearchmachine pushed a commit to elasticsearchmachine/elasticsearch that referenced this pull request Jul 8, 2021
This is a followup to elastic#74976.

The changes of elastic#74976 reverted many of the changes of elastic#71656
because elastic#74415 made them redundant. elastic#74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of elastic#71656
but in a different place that hopefully won't reintroduce the
problems that led to elastic#74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates elastic#75069
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
7.14
7.x

elasticsearchmachine added a commit that referenced this pull request Jul 8, 2021
…5116)

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

Co-authored-by: David Roberts <[email protected]>
elasticsearchmachine added a commit that referenced this pull request Jul 8, 2021
…5117)

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

Co-authored-by: David Roberts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :ml Machine learning Team:ML Meta label for the ML team v7.14.0 v7.15.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants