[ML] Fix race condition between job opening and feature reset #74976

droberts195 · 2021-07-06T13:17:00Z

There was a point during the job opening sequence where performing
a feature reset could hang.

This happened when the kill request issued by feature reset was
executed after the job's persistent task was assigned but before
the job's native process was started. The persistent task was
incorrectly left running in this situation, yet the job opening
sequence was aborted which meant the subsequent close request
issued by feature reset would wait for a very long time for the
persistent task to disappear.

The fix is to make the kill process request cancel the persistent
task consistently based on its request parameters and not on the
current state of the task.

Fixes #74141

There was a point during the job opening sequence where performing a feature reset could hang. This happened when the kill request issued by feature reset was executed after the job's persistent task was assigned but before the job's native process was started. The persistent task was incorrectly left running in this situation, yet the job opening sequence was aborted which meant the subsequent close request issued by feature reset would wait for a very long time for the persistent task to disappear. The fix is to make the kill process request cancel the persistent task consistently based on its request parameters and not on the current state of the task. Fixes elastic#74141

elasticmachine · 2021-07-06T13:17:03Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2021-07-06T13:18:28Z

Although there are no explicit tests to prove this fix works, we do a feature reset at the end of every integration test, so it gets a lot of test coverage that way.

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/ProcessContext.java

Co-authored-by: David Kyle <[email protected]>

The changes of elastic#74415 made some of the changes of elastic#71656 redundant. This commit is deleting code from elastic#71656 that would never execute now.

davidkyle

LGTM

There was a point during the job opening sequence where performing a feature reset could hang. This happened when the kill request issued by feature reset was executed after the job's persistent task was assigned but before the job's native process was started. The persistent task was incorrectly left running in this situation, yet the job opening sequence was aborted which meant the subsequent close request issued by feature reset would wait for a very long time for the persistent task to disappear. The fix is to make the kill process request cancel the persistent task consistently based on its request parameters and not on the current state of the task. Backport of elastic#74976

There was a point during the job opening sequence where performing a feature reset could hang. This happened when the kill request issued by feature reset was executed after the job's persistent task was assigned but before the job's native process was started. The persistent task was incorrectly left running in this situation, yet the job opening sequence was aborted which meant the subsequent close request issued by feature reset would wait for a very long time for the persistent task to disappear. The fix is to make the kill process request cancel the persistent task consistently based on its request parameters and not on the current state of the task. Backport of #74976

This is a followup to elastic#74976. The changes of elastic#74976 reverted many of the changes of elastic#71656 because elastic#74415 made them redundant. elastic#74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of elastic#71656 but in a different place that hopefully won't reintroduce the problems that led to elastic#74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates elastic#75069

This is a followup to #74976. The changes of #74976 reverted many of the changes of #71656 because #74415 made them redundant. #74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of #71656 but in a different place that hopefully won't reintroduce the problems that led to #74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates #75069

This is a followup to elastic#74976. The changes of elastic#74976 reverted many of the changes of elastic#71656 because elastic#74415 made them redundant. elastic#74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of elastic#71656 but in a different place that hopefully won't reintroduce the problems that led to elastic#74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates elastic#75069

…5116) This is a followup to #74976. The changes of #74976 reverted many of the changes of #71656 because #74415 made them redundant. #74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of #71656 but in a different place that hopefully won't reintroduce the problems that led to #74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates #75069 Co-authored-by: David Roberts <[email protected]>

…5117) This is a followup to #74976. The changes of #74976 reverted many of the changes of #71656 because #74415 made them redundant. #74415 did this by making killed jobs as closing so that the standard "job closed immediately after open" functionality was used instead of reissuing the kill immediately after opening. However, it turns out that this "job closed immediately after open" functionality is not perfect for the case of a job that is killed while it is opening. It causes AutodetectCommunicator.close() to be called instead of AutodetectCommunicator.killProcess(). Both do a lot of the same things, but AutodetectCommunicator.close() finalizes the job, and this can cause problems if the job is being killed as part of a feature reset. This change reinstates some of the functionality of #71656 but in a different place that hopefully won't reintroduce the problems that led to #74415. We can detect that a kill has happened early on during an open or close operation by checking if the task's allocation ID has been removed from the map after ProcessContext.setDying() returns true. If ProcessContext.setDying() returns true this means the job has not been previously closed, so it must have been killed. Then we can call AutodetectCommunicator.killProcess() instead of AutodetectCommunicator.close() during the cleanup that happens when we detect that a recently started process is no longer wanted. Relates #75069 Co-authored-by: David Roberts <[email protected]>

droberts195 added >bug :ml Machine learning v8.0.0 v7.14.1 v7.15.0 labels Jul 6, 2021

elasticmachine added the Team:ML Meta label for the ML team label Jul 6, 2021

davidkyle reviewed Jul 6, 2021

View reviewed changes

droberts195 and others added 7 commits July 6, 2021 16:21

Apply suggestions from code review

eddc498

Co-authored-by: David Kyle <[email protected]>

Merge branch 'master' into fix_upgrade_mode_test_again

c63b1b3

Rename variable

f82f09c

Making debug messages more precise

85163d3

Remove redundant variable

a549478

Merge branch 'master' into fix_upgrade_mode_test_again

03bde66

Simplification

23d704d

The changes of elastic#74415 made some of the changes of elastic#71656 redundant. This commit is deleting code from elastic#71656 that would never execute now.

davidkyle approved these changes Jul 7, 2021

View reviewed changes

droberts195 merged commit ace2988 into elastic:master Jul 7, 2021

droberts195 deleted the fix_upgrade_mode_test_again branch July 7, 2021 10:04

This was referenced Jul 7, 2021

[ML] Fix race condition between job opening and feature reset #75024

Merged

[ML] Fix race condition between job opening and feature reset #75025

Merged

droberts195 mentioned this pull request Jul 8, 2021

[ML] Fix race condition between job open, close and kill #75113

Merged

probakowski removed the v7.14.1 label Jul 8, 2021

probakowski added the v7.14.0 label Jul 8, 2021

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Fix race condition between job opening and feature reset #74976

[ML] Fix race condition between job opening and feature reset #74976

Uh oh!

droberts195 commented Jul 6, 2021

Uh oh!

elasticmachine commented Jul 6, 2021

Uh oh!

droberts195 commented Jul 6, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidkyle left a comment

Uh oh!

Uh oh!

[ML] Fix race condition between job opening and feature reset #74976

[ML] Fix race condition between job opening and feature reset #74976

Uh oh!

Conversation

droberts195 commented Jul 6, 2021

Uh oh!

elasticmachine commented Jul 6, 2021

Uh oh!

droberts195 commented Jul 6, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!