[ML] External components cannot accurately tell when model snapshot upgrade is running #81519

droberts195 · 2021-12-08T12:19:20Z

The model snapshot upgrade endpoint needs to integrate well with the code in upgrade assistant that uses it.

Currently on a multi-node cluster upgrade assistant reports that model snapshot upgrades failed when they actually succeeded:

The most likely reason why this happens on multi-node clusters but not single node clusters is that the upgrade assistant monitors the progress of the model snapshot upgrade by looking at the tasks API, which is returning information about local tasks, but the upgrade model snapshot endpoint returns as soon as the persistent task exists. In a multi-node cluster there will be a delay between the persistent task existing and the local task existing if the assigned node is not the master node. During this period, if the upgrade assistant checks progress it assumes an unexpected server error has occurred.

The ideal way to fix this would be to introduce a stats API for model snapshot upgrades. However, doing that would mean completely changing the way the upgrade assistant monitors snapshot upgrade progress, which would be tricky at this stage of development.

Therefore, the pragmatic solution to this problem is to change the upgrade model snapshot endpoint so that it only returns to caller once the local task exists as well as the persistent task. This will entail polling periodically from the master for the existence of the local task on the assigned node until it exists.

elasticmachine · 2021-12-08T12:19:24Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2021-12-08T13:42:16Z

Therefore, the pragmatic solution to this problem is to change the upgrade model snapshot endpoint so that it only returns to caller once the local task exists as well as the persistent task. This will entail polling periodically from the master for the existence of the local task on the assigned node until it exists.

This could take a long time and possibly require an autoscaling event. I am not 100% sure this is the way.

droberts195 · 2021-12-08T13:57:47Z

This could take a long time and possibly require an autoscaling event

Yes, good point. But then the only way would seem to be to introduce a new endpoint to get model snapshot upgrade stats. Then the UI would poll that instead of the tasks API. I guess we could add that for 7.17.

The part of the code that would need changing in the Kibana upgrade assistant is then: https://github.com/elastic/kibana/pull/100066/files#diff-b072d906e4d7b04d2c87a7343f9b23dc887c35544ddf71f6fdfb258b699bd325R205-R239

Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes elastic#81519

Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes #81519

Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes elastic#81519

Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes #81519

droberts195 added >bug :ml Machine learning v7.16.0 labels Dec 8, 2021

droberts195 self-assigned this Dec 8, 2021

elasticmachine added the Team:ML Meta label for the ML team label Dec 8, 2021

droberts195 mentioned this issue Dec 8, 2021

Bug deciphering ML model snapshot task names for jobs with hyphens elastic/kibana#120607

Closed

droberts195 changed the title ~~[ML] Model snapshot upgrade shouldn't return until local task exists~~ [ML] External components cannot accurately tell when model snapshot upgrade is running Dec 9, 2021

droberts195 mentioned this issue Dec 9, 2021

[ML] Model snapshot upgrade processes need better cleanup/cancellation functionality #81578

Closed

droberts195 mentioned this issue Dec 13, 2021

[ML] Model snapshot upgrade needs a stats endpoint #81641

Merged

droberts195 closed this as completed in #81641 Dec 14, 2021

droberts195 mentioned this issue Dec 15, 2021

Fix upgrade assistant tracking of ML job model snapshot upgrade status elastic/kibana#121313

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] External components cannot accurately tell when model snapshot upgrade is running #81519

[ML] External components cannot accurately tell when model snapshot upgrade is running #81519

droberts195 commented Dec 8, 2021

elasticmachine commented Dec 8, 2021

Uh oh!

benwtrent commented Dec 8, 2021

Uh oh!

droberts195 commented Dec 8, 2021

Uh oh!

[ML] External components cannot accurately tell when model snapshot upgrade is running #81519

[ML] External components cannot accurately tell when model snapshot upgrade is running #81519

Comments

droberts195 commented Dec 8, 2021

elasticmachine commented Dec 8, 2021

Uh oh!

benwtrent commented Dec 8, 2021

Uh oh!

droberts195 commented Dec 8, 2021

Uh oh!