[ML] Fix race condition between job open, close and kill #75113

droberts195 · 2021-07-08T09:38:03Z

This is a followup to #74976.

The changes of #74976 reverted many of the changes of #71656
because #74415 made them redundant. #74415 did this by making
killed jobs as closing so that the standard "job closed immediately
after open" functionality was used instead of reissuing the kill
immediately after opening. However, it turns out that this
"job closed immediately after open" functionality is not
perfect for the case of a job that is killed while it is opening.
It causes AutodetectCommunicator.close() to be called instead
of AutodetectCommunicator.killProcess(). Both do a lot of the
same things, but AutodetectCommunicator.close() finalizes
the job, and this can cause problems if the job is being killed
as part of a feature reset.

This change reinstates some of the functionality of #71656
but in a different place that hopefully won't reintroduce the
problems that led to #74415.

We can detect that a kill has happened early on during an
open or close operation by checking if the task's allocation
ID has been removed from the map after ProcessContext.setDying()
returns true. If ProcessContext.setDying() returns true this
means the job has not been previously closed, so it must have
been killed. Then we can call AutodetectCommunicator.killProcess()
instead of AutodetectCommunicator.close() during the cleanup
that happens when we detect that a recently started process is
no longer wanted.

Relates #75069

This is a followup to elastic#74976. The changes of elastic#74976 reverted many of the changes of elastic#71656 because elastic#74415 made them redundant. elastic#74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of elastic#71656 but in a different place that hopefully won't reintroduce the problems that led to elastic#74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates elastic#75069

elasticmachine · 2021-07-08T09:38:07Z

Pinging @elastic/ml-core (Team:ML)

dimitris-athanasiou

LGTM

This is a followup to elastic#74976. The changes of elastic#74976 reverted many of the changes of elastic#71656 because elastic#74415 made them redundant. elastic#74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of elastic#71656 but in a different place that hopefully won't reintroduce the problems that led to elastic#74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates elastic#75069

elasticsearchmachine · 2021-07-08T10:20:57Z

💚 Backport successful

Status	Branch	Result
✅	7.14
✅	7.x

…5116) This is a followup to #74976. The changes of #74976 reverted many of the changes of #71656 because #74415 made them redundant. #74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of #71656 but in a different place that hopefully won't reintroduce the problems that led to #74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates #75069 Co-authored-by: David Roberts <[email protected]>

…5117) This is a followup to #74976. The changes of #74976 reverted many of the changes of #71656 because #74415 made them redundant. #74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of #71656 but in a different place that hopefully won't reintroduce the problems that led to #74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates #75069 Co-authored-by: David Roberts <[email protected]>

droberts195 added >bug :ml Machine learning v8.0.0 auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged v7.14.1 v7.15.0 labels Jul 8, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jul 8, 2021

droberts195 mentioned this pull request Jul 8, 2021

[CI] XPackRestIT failing on various YAML tests #75069

Closed

dimitris-athanasiou approved these changes Jul 8, 2021

View reviewed changes

elasticsearchmachine merged commit 0f493c3 into elastic:master Jul 8, 2021

droberts195 deleted the another_close_kill_race branch July 8, 2021 10:19

elasticsearchmachine mentioned this pull request Jul 8, 2021

[7.14] [ML] Fix race condition between job open, close and kill (#75113) #75116

Merged

elasticsearchmachine mentioned this pull request Jul 8, 2021

[7.x] [ML] Fix race condition between job open, close and kill (#75113) #75117

Merged

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

probakowski added v7.14.0 and removed v7.14.1 labels Jul 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Fix race condition between job open, close and kill #75113

[ML] Fix race condition between job open, close and kill #75113

Uh oh!

droberts195 commented Jul 8, 2021

Uh oh!

elasticmachine commented Jul 8, 2021

Uh oh!

dimitris-athanasiou left a comment

Uh oh!

elasticsearchmachine commented Jul 8, 2021

Uh oh!

Uh oh!

[ML] Fix race condition between job open, close and kill #75113

[ML] Fix race condition between job open, close and kill #75113

Uh oh!

Conversation

droberts195 commented Jul 8, 2021

Uh oh!

elasticmachine commented Jul 8, 2021

Uh oh!

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Jul 8, 2021

💚 Backport successful

Uh oh!

Uh oh!