-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] External components cannot accurately tell when model snapshot upgrade is running #81519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Pinging @elastic/ml-core (Team:ML) |
This could take a long time and possibly require an autoscaling event. I am not 100% sure this is the way. |
Yes, good point. But then the only way would seem to be to introduce a new endpoint to get model snapshot upgrade stats. Then the UI would poll that instead of the tasks API. I guess we could add that for 7.17. The part of the code that would need changing in the Kibana upgrade assistant is then: https://github.com/elastic/kibana/pull/100066/files#diff-b072d906e4d7b04d2c87a7343f9b23dc887c35544ddf71f6fdfb258b699bd325R205-R239 |
Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes elastic#81519
Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes #81519
Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes elastic#81519
Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes #81519
The model snapshot upgrade endpoint needs to integrate well with the code in upgrade assistant that uses it.
Currently on a multi-node cluster upgrade assistant reports that model snapshot upgrades failed when they actually succeeded:
The most likely reason why this happens on multi-node clusters but not single node clusters is that the upgrade assistant monitors the progress of the model snapshot upgrade by looking at the tasks API, which is returning information about local tasks, but the upgrade model snapshot endpoint returns as soon as the persistent task exists. In a multi-node cluster there will be a delay between the persistent task existing and the local task existing if the assigned node is not the master node. During this period, if the upgrade assistant checks progress it assumes an unexpected server error has occurred.
The ideal way to fix this would be to introduce a stats API for model snapshot upgrades. However, doing that would mean completely changing the way the upgrade assistant monitors snapshot upgrade progress, which would be tricky at this stage of development.
Therefore, the pragmatic solution to this problem is to change the upgrade model snapshot endpoint so that it only returns to caller once the local task exists as well as the persistent task. This will entail polling periodically from the master for the existence of the local task on the assigned node until it exists.
The text was updated successfully, but these errors were encountered: