Skip to content

Preserve Task Id for ML Datafeed #54943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 9, 2020

Conversation

albertzaharovits
Copy link
Contributor

Task ids are a way to express a relationship between related internal requests.
This change preserves the task id for internal requests of the StartDatafeedPersistentTask, similarly to how RollupJobPersistentTask does it.

Relates #52314

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@davidkyle
Copy link
Member

davidkyle commented Apr 8, 2020

Hi @albertzaharovits we have other places we are using a client - specifically indexing the results and model state- should we be using ParentTaskAssigningClient everywhere.

For example: https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobResultsPersister.java#L84

@droberts195
Copy link
Contributor

should we be using ParentTaskAssigningClient everywhere

Maybe we should in theory, but I think this is a far reaching and high risk change. I suspect we will be spending a lot of time clearing up side effects. Our tasks are very complex, fire off numerous sub-tasks, and we have dedicated open/close/stop/start APIs for orchestrating them. I strongly suspect we will get into a mess with task cancellation if we try to link our controlling job/datafeed tasks to every sub-operation they perform. @dimitris-athanasiou has already run into this with the reindex task that data frame analytics jobs kick off.

@albertzaharovits
Copy link
Contributor Author

I have not yet looked into the data frame analytics tasks and the transform tasks, but I will do that in a follow-up. My end goal is to track all client requests for all persistent tasks in the same way, but I don't see an overarching way to achieve it, without looking into every case. Do you reckon this is a valid strategy?

This PR is only for the DataFeed persistent task, do you reckon this achieves the scope?

I wonder how should we test it, I don't have a good plan for it.

@davidkyle
Copy link
Member

In terms of tracking tasks I think this is a very useful change and valuable for debugging Datafeed issues as we should now know which queries belong to which datafeed.

We already handle task cancellation for the Datafeed and AD jobs which gracefully stops the datafeed waiting any executing query to finish.

How does task cancellation work? If you cancel the parent task are all child tasks then cancelled or is it up to the cancellation handler of the parent task to manage the children?

I agree managing the AD tasks and processes is a very complicated can of worms and echo @droberts195 caution about side effects but if this change is just about tracking and doesn't affect how task cancellation works then it is good

@davidkyle
Copy link
Member

I don't see anything in the docs but looking at TransportCancelTasksAction I think we have a problem in that the child tasks are also cancelled if CancellableTask.shouldCancelChildrenOnCancellation() == true unfortunately the implementation in AllocatedPersistentTask.shouldCancelChildrenOnCancellation() returns true.

My understanding of the current situation for datafeeds is that because the search tasks are not child tasks of the datafeed they will not be cancelled automatically and the graceful shutdown works.

We could change shouldCancelChildrenOnCancellation() to return false for the DatafeedTasks which should preserve the graceful shutdown behaviour once the search becomes a child task.

@albertzaharovits @droberts195 what do you think? I think it is worth trying for the benefit of tracking which tasks are started by the datafeed

@albertzaharovits
Copy link
Contributor Author

@elasticmachine run elasticsearch-ci/2

@albertzaharovits
Copy link
Contributor Author

My understanding of the current situation for datafeeds is that because the search tasks are not child tasks of the datafeed they will not be cancelled automatically and the graceful shutdown works.
We could change shouldCancelChildrenOnCancellation() to return false for the DatafeedTasks which should preserve the graceful shutdown behaviour once the search becomes a child task.

Good point @davidkyle ! I also agree that because Datafeeds already implement graceful task cancellation we should make shouldCancelChildrenOnCancellation() return false, i.e. use task ids only for tracking, and not cancellation purposes.

@droberts195
Copy link
Contributor

Going back to the reason why this change is being done, #52314, it looks like it's most important that the sub-tasks that get parented to the persistent task are the ones that run with the credentials of the user that created the configuration. The sub-tasks that run as an internal user don't need to be changed to assist with accurate auditing, but maybe could be changed for consistency.

We have had so many problems over the years with obscure edge cases causing ML jobs/datafeeds to not stop. Ironically the failure in elasticsearch-ci/2 in the latest PR CI run for this PR is an example of this in org.elasticsearch.xpack.ml.integration.ClassificationIT.testStopAndRestart - since that's in data frame analytics rather than anomaly detection it's almost certainly not caused by the changes in this PR, so it's an example of the general problems we have with starting and stopping tasks that have extremely complex internal states.

we should make shouldCancelChildrenOnCancellation() return false

Yes, I agree. This is the least risky change from where we are today. Please can you try adding that to this PR and see if any tests related to datafeeds fail as a result. If not then I think this PR is good.

To clean the audit logs similar PRs will probably be needed for data frame analytics and transforms.

@albertzaharovits
Copy link
Contributor Author

Thank you for the feedback @droberts195 ! Very much appreciated.

it looks like it's most important that the sub-tasks that get parented to the persistent task are the ones that run with the credentials of the user that created the configuration. The sub-tasks that run as an internal user don't need to be changed to assist with accurate auditing, but maybe could be changed for consistency.

Yes, this is what I had in mind as well. Trace the requests using the client's credentials, but don't try too hard to avoid the other internal requests doing the same.

We have had so many problems over the years with obscure edge cases causing ML jobs/datafeeds to not stop. Ironically the failure in elasticsearch-ci/2 in the latest PR CI run for this PR is an example of this in org.elasticsearch.xpack.ml.integration.ClassificationIT.testStopAndRestart - since that's in data frame analytics rather than anomaly detection it's almost certainly not caused by the changes in this PR, so it's an example of the general problems we have with starting and stopping tasks that have extremely complex internal states.

LOL

@albertzaharovits
Copy link
Contributor Author

@droberts195 I've made the changes as suggested.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@albertzaharovits albertzaharovits merged commit 1d9096f into elastic:master Apr 9, 2020
@albertzaharovits albertzaharovits deleted the ml_jobs_task_id branch April 9, 2020 09:07
albertzaharovits added a commit that referenced this pull request Apr 9, 2020
This change preserves the task id for internal requests for the `StartDatafeedPersistentTask`.

Task ids are a way to express a relationship between related internal requests.
In this particular case, the task ids are used for debugging and (soon) security auditing,
but not for task cancellation, because there is already a graceful-shutdown of child
internal requests (given a task id) in place.
albertzaharovits added a commit that referenced this pull request Apr 10, 2020
This change makes sure that all internal client requests spawned by the
data frame analytics persistent task executor and that use the end user
security credentials, have the parent task id assigned. The objective here
is to permit auditing (as well as tracking for debugging purposes) of all
the end-user requests executed on its behalf by persistent tasks.
Because data frame analytics taks already implements graceful shutdown
of child tasks, this change does not interfere with it by opting out of
the persistent task cancellation of child tasks.

Relates #54943 #52314
albertzaharovits added a commit that referenced this pull request Apr 10, 2020
This change makes sure that all internal client requests spawned by the
data frame analytics persistent task executor and that use the end user
security credentials, have the parent task id assigned. The objective here
is to permit auditing (as well as tracking for debugging purposes) of all
the end-user requests executed on its behalf by persistent tasks.
Because data frame analytics taks already implements graceful shutdown
of child tasks, this change does not interfere with it by opting out of
the persistent task cancellation of child tasks.

Relates #54943 #52314
albertzaharovits added a commit that referenced this pull request Apr 14, 2020
This change ensures that internal client requests spawned by the
transform persistent task executor and that use the end user security
credentials, have the parent task id assigned. The objective here is
to permit auditing (as well as tracking for debugging purposes) of all
the end-user requests executed on its behalf by persistent tasks.
Because transform tasks already implements graceful shutdown of the
child tasks, this change does not interfere with that by opting out of
the persistent task cancellation of child tasks.

Relates #55046 #54943 #52314
Closes #54957
albertzaharovits added a commit that referenced this pull request Apr 14, 2020
This change ensures that internal client requests spawned by the
transform persistent task executor and that use the end user security
credentials, have the parent task id assigned. The objective here is
to permit auditing (as well as tracking for debugging purposes) of all
the end-user requests executed on its behalf by persistent tasks.
Because transform tasks already implements graceful shutdown of the
child tasks, this change does not interfere with that by opting out of
the persistent task cancellation of child tasks.

Relates #55046 #54943 #52314
Closes #54957
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants