Skip to content

[ML] External components cannot accurately tell when model snapshot upgrade is running #81519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
droberts195 opened this issue Dec 8, 2021 · 3 comments · Fixed by #81641
Closed
Assignees
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v7.16.0

Comments

@droberts195
Copy link
Contributor

The model snapshot upgrade endpoint needs to integrate well with the code in upgrade assistant that uses it.

Currently on a multi-node cluster upgrade assistant reports that model snapshot upgrades failed when they actually succeeded:

image (2)

The most likely reason why this happens on multi-node clusters but not single node clusters is that the upgrade assistant monitors the progress of the model snapshot upgrade by looking at the tasks API, which is returning information about local tasks, but the upgrade model snapshot endpoint returns as soon as the persistent task exists. In a multi-node cluster there will be a delay between the persistent task existing and the local task existing if the assigned node is not the master node. During this period, if the upgrade assistant checks progress it assumes an unexpected server error has occurred.

The ideal way to fix this would be to introduce a stats API for model snapshot upgrades. However, doing that would mean completely changing the way the upgrade assistant monitors snapshot upgrade progress, which would be tricky at this stage of development.

Therefore, the pragmatic solution to this problem is to change the upgrade model snapshot endpoint so that it only returns to caller once the local task exists as well as the persistent task. This will entail polling periodically from the master for the existence of the local task on the assigned node until it exists.

@droberts195 droberts195 added >bug :ml Machine learning v7.16.0 labels Dec 8, 2021
@droberts195 droberts195 self-assigned this Dec 8, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Dec 8, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Member

Therefore, the pragmatic solution to this problem is to change the upgrade model snapshot endpoint so that it only returns to caller once the local task exists as well as the persistent task. This will entail polling periodically from the master for the existence of the local task on the assigned node until it exists.

This could take a long time and possibly require an autoscaling event. I am not 100% sure this is the way.

@droberts195
Copy link
Contributor Author

This could take a long time and possibly require an autoscaling event

Yes, good point. But then the only way would seem to be to introduce a new endpoint to get model snapshot upgrade stats. Then the UI would poll that instead of the tasks API. I guess we could add that for 7.17.

The part of the code that would need changing in the Kibana upgrade assistant is then: https://github.com/elastic/kibana/pull/100066/files#diff-b072d906e4d7b04d2c87a7343f9b23dc887c35544ddf71f6fdfb258b699bd325R205-R239

@droberts195 droberts195 changed the title [ML] Model snapshot upgrade shouldn't return until local task exists [ML] External components cannot accurately tell when model snapshot upgrade is running Dec 9, 2021
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Dec 13, 2021
Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes elastic#81519
droberts195 added a commit that referenced this issue Dec 14, 2021
Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes #81519
droberts195 added a commit to droberts195/elasticsearch that referenced this issue Dec 14, 2021
Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes elastic#81519
elasticsearchmachine pushed a commit that referenced this issue Dec 14, 2021
Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes #81519
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v7.16.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants