-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[ML] Model snapshot upgrade needs a stats endpoint #81641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Model snapshot upgrade needs a stats endpoint #81641
Conversation
Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes elastic#81519
Pinging @elastic/ml-core (Team:ML) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good stuff. Some minor things
== {api-description-title} | ||
|
||
{anomaly-detect-cap} job model snapshot upgrades are ephemeral. Only | ||
upgrades that are in progress are the time this API is called will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upgrades that are in progress are the time this API is called will be | |
upgrades that are in progress at the time this API is called will be |
return statsBuilder.setNode(state.getNodes().get(t.getExecutorNode())) | ||
.setUpgradeState(MlTasks.getSnapshotUpgradeState(t)) | ||
.setAssignmentExplanation(t.getAssignment().getExplanation()) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.getExecutorNode()
could be null
if the task isn't yet assigned. We should handle when if (t.getExecutorNode() == null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will make much difference to performance, but I'll change it if it makes the code clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
state.getNodes().get(t.getExecutorNode())
I am more worried that state.getNodes().get(null)
throws
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-get-job-model-snapshot-upgrade-stats.html", | ||
"description":"Gets stats for anomaly detection job model snapshot upgrades that are in progress." | ||
}, | ||
"stability":"stable", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's that strange. When we add whole swathes of new functionality like data frame analytics, transforms or trained models we start off those whole areas as experimental or beta. But this is adding a minor new API to an area that has already been classed as stable. To be honest it's extremely unlikely that any software apart from Kibana upgrade assistant would use the model snapshot upgrade functionality, but if it did then it's been shown that the model snapshot upgrade functionality cannot be reliably used without this API. So it doesn't make sense for upgrade model snapshot to be a required part of the process of upgrading to version 8, yet this part of it isn't supported.
@@ -17,17 +17,96 @@ setup: | |||
} | |||
} | |||
|
|||
# It's too hard to create a genuine model snapshot in a YAML test. | |||
# All we can do is create the descriptor doc that will allow the | |||
# endpoints to get past an existence chech. Then the actual snapshot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# endpoints to get past an existence chech. Then the actual snapshot | |
# endpoints to get past an existence check. Then the actual snapshot |
Pinging @elastic/clients-team (Team:Clients) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments: This PR is tagged as 7.16.2 and 8.0.0 which are past feature freeze, is that intentional? We typically don't add new APIs in patch releases.
One thought from me is this API seems like it might be better as get_model_snapshots_stats
instead of get_job_model_snapshot_upgrade_stats
and then all flavors of stats about a model snapshot could be included in this API, each in their own nested object.
I note this because I remember seeing a separate trained model stats API getting rolled into get_trained_model_stats
. Does that also make sense here?
Yes, this has to go into 7.last. We're requiring users upgrade old anomaly detection job model snapshots before moving to version 8, but #81519 shows that the functionality we have so far for this doesn't work. We tried to think of another way to make it work that didn't involve adding another endpoint, but couldn't think of one. Having said that, I don't think it's critical for the clients to add support for this API immediately. The only place I can think of where it will be critical to call it is in the Kibana upgrade assistant, and I believe that will be able to make an ES call that doesn't have high level client support. So it's fine if it's made available in the clients in 8.1 or later.
There's quite a lot of history here. In the code the word "job" refers to what we refer to externally as "anomaly detection jobs" in documentation and "Trained models" are used with inference, and are completely different to anomaly detection models. We wouldn't want an API that combined both in any way. As for whether we have upgrade stats or just stats, the thing about upgrades is that they cause an ML C++ process to run on one node in the cluster. This endpoint is primarily about finding out if such a process is running/is supposed to be running. We expect these upgrades to be very few and far between. At the moment these "stats" don't even contain any numerical stats - the only "stats" are the node and state. This is the bare minimum necessary make upgrade assistant work in 7.last. Possibly in the future there could be some numerical stats too, and the pattern that the node and state are returned by the "stats" endpoint is established by all our other endpoints that return these (get job stats, get datafeed stats, get data frame analytics stats). |
@@ -0,0 +1,40 @@ | |||
{ | |||
"ml.get_job_snapshot_upgrade_stats":{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing more with @sethmlarson we've decided to change this to ml.get_model_snapshot_upgrade_stats
.
job_snapshot
used originally here is consistent with ml.upgrade_job_snapshot
.
But model_snapshot
is consistent with all the other endpoints that relate to model snapshots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good, one non-blocking comment:
snapshot_id: "9999999999" | ||
|
||
--- | ||
"Test stats all snapshots": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently there are no requests to the get_model_snapshot_upgrade_stats
API in YAML tests that aren't either a 404 or skipped. Could we add a test that isn't skipped that has a valid response so the clients team can create specs for and test the response structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding that test in this PR would involve fixing #81578 in this same PR. I was going to fix that and unmute the muted test at the end of the file in a separate followup PR this week, just to keep the sizes of the PRs manageable.
The YAML test framework isn't rich enough to test upgrade of a model snapshot that's created by the test itself. But the test at the end that gets the stats for upgrade of the corrupt model snapshot will return the same fields that getting stats for upgrade of a valid snapshot would get. So could you wait until the end of the week before creating specs and testing the response structure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging this and fixing the test later works for me! I figured that was the plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sethmlarson the test that should give the flight recorder a response it can get the structure from is unmuted in #81831 - see https://github.com/elastic/elasticsearch/pull/81831/files#diff-0a85b50d1fe60d5fcfbfabef2bb1a3cc7047b5f86374e8279f82a343ee4e6f03L96-L98.
Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes elastic#81519
💔 Backport failed
You can use sqren/backport to manually backport by running |
Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes #81519
* [7.16] [ML] Model snapshot upgrade needs a stats endpoint Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Backport of #81641 * Fixing compilation
Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.
This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.
Fixes #81519
Docs preview: https://elasticsearch_81641.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/ml-get-job-model-snapshot-upgrade-stats.html