Skip to content

[ML] Model snapshot upgrade needs a stats endpoint #81641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 14, 2021

Conversation

droberts195
Copy link
Contributor

@droberts195 droberts195 commented Dec 13, 2021

Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes #81519

Docs preview: https://elasticsearch_81641.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/ml-get-job-model-snapshot-upgrade-stats.html

Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes elastic#81519
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Dec 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good stuff. Some minor things

== {api-description-title}

{anomaly-detect-cap} job model snapshot upgrades are ephemeral. Only
upgrades that are in progress are the time this API is called will be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
upgrades that are in progress are the time this API is called will be
upgrades that are in progress at the time this API is called will be

Comment on lines 93 to 96
return statsBuilder.setNode(state.getNodes().get(t.getExecutorNode()))
.setUpgradeState(MlTasks.getSnapshotUpgradeState(t))
.setAssignmentExplanation(t.getAssignment().getExplanation())
.build();
Copy link
Member

@benwtrent benwtrent Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.getExecutorNode() could be null if the task isn't yet assigned. We should handle when if (t.getExecutorNode() == null)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it will make much difference to performance, but I'll change it if it makes the code clearer

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state.getNodes().get(t.getExecutorNode())

I am more worried that state.getNodes().get(null) throws

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-get-job-model-snapshot-upgrade-stats.html",
"description":"Gets stats for anomaly detection job model snapshot upgrades that are in progress."
},
"stability":"stable",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bold

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's that strange. When we add whole swathes of new functionality like data frame analytics, transforms or trained models we start off those whole areas as experimental or beta. But this is adding a minor new API to an area that has already been classed as stable. To be honest it's extremely unlikely that any software apart from Kibana upgrade assistant would use the model snapshot upgrade functionality, but if it did then it's been shown that the model snapshot upgrade functionality cannot be reliably used without this API. So it doesn't make sense for upgrade model snapshot to be a required part of the process of upgrading to version 8, yet this part of it isn't supported.

@@ -17,17 +17,96 @@ setup:
}
}

# It's too hard to create a genuine model snapshot in a YAML test.
# All we can do is create the descriptor doc that will allow the
# endpoints to get past an existence chech. Then the actual snapshot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# endpoints to get past an existence chech. Then the actual snapshot
# endpoints to get past an existence check. Then the actual snapshot

@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Dec 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments: This PR is tagged as 7.16.2 and 8.0.0 which are past feature freeze, is that intentional? We typically don't add new APIs in patch releases.

One thought from me is this API seems like it might be better as get_model_snapshots_stats instead of get_job_model_snapshot_upgrade_stats and then all flavors of stats about a model snapshot could be included in this API, each in their own nested object.

I note this because I remember seeing a separate trained model stats API getting rolled into get_trained_model_stats. Does that also make sense here?

@droberts195
Copy link
Contributor Author

@sethmlarson

This PR is tagged as 7.16.2 and 8.0.0 which are past feature freeze, is that intentional?

Yes, this has to go into 7.last. We're requiring users upgrade old anomaly detection job model snapshots before moving to version 8, but #81519 shows that the functionality we have so far for this doesn't work. We tried to think of another way to make it work that didn't involve adding another endpoint, but couldn't think of one.

Having said that, I don't think it's critical for the clients to add support for this API immediately. The only place I can think of where it will be critical to call it is in the Kibana upgrade assistant, and I believe that will be able to make an ES call that doesn't have high level client support. So it's fine if it's made available in the clients in 8.1 or later.

One thought from me is this API seems like it might be better as get_model_snapshots_stats instead of get_job_model_snapshot_upgrade_stats and then all flavors of stats about a model snapshot could be included in this API, each in their own nested object.

I note this because I remember seeing a separate trained model stats API getting rolled into get_trained_model_stats. Does that also make sense here?

There's quite a lot of history here. In the code the word "job" refers to what we refer to externally as "anomaly detection jobs" in documentation and anomaly_detectors in endpoint URLs. It dates back to the days when we only had one type of job. So the word job in get_job_model_snapshot_upgrade_stats should arguably be anomaly_detector, but then it would be inconsistent with all the other files in that directory that have job in their names. Given where we are it's probably the less of two evils to maintain the current pattern than to try to switch but only for the most obscure and specialized late-added endpoints.

"Trained models" are used with inference, and are completely different to anomaly detection models. We wouldn't want an API that combined both in any way.

As for whether we have upgrade stats or just stats, the thing about upgrades is that they cause an ML C++ process to run on one node in the cluster. This endpoint is primarily about finding out if such a process is running/is supposed to be running. We expect these upgrades to be very few and far between. At the moment these "stats" don't even contain any numerical stats - the only "stats" are the node and state. This is the bare minimum necessary make upgrade assistant work in 7.last. Possibly in the future there could be some numerical stats too, and the pattern that the node and state are returned by the "stats" endpoint is established by all our other endpoints that return these (get job stats, get datafeed stats, get data frame analytics stats).

@@ -0,0 +1,40 @@
{
"ml.get_job_snapshot_upgrade_stats":{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing more with @sethmlarson we've decided to change this to ml.get_model_snapshot_upgrade_stats.

job_snapshot used originally here is consistent with ml.upgrade_job_snapshot.

But model_snapshot is consistent with all the other endpoints that relate to model snapshots.

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good, one non-blocking comment:

snapshot_id: "9999999999"

---
"Test stats all snapshots":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there are no requests to the get_model_snapshot_upgrade_stats API in YAML tests that aren't either a 404 or skipped. Could we add a test that isn't skipped that has a valid response so the clients team can create specs for and test the response structure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding that test in this PR would involve fixing #81578 in this same PR. I was going to fix that and unmute the muted test at the end of the file in a separate followup PR this week, just to keep the sizes of the PRs manageable.

The YAML test framework isn't rich enough to test upgrade of a model snapshot that's created by the test itself. But the test at the end that gets the stats for upgrade of the corrupt model snapshot will return the same fields that getting stats for upgrade of a valid snapshot would get. So could you wait until the end of the week before creating specs and testing the response structure?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging this and fixing the test later works for me! I figured that was the plan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethmlarson the test that should give the flight recorder a response it can get the structure from is unmuted in #81831 - see https://github.com/elastic/elasticsearch/pull/81831/files#diff-0a85b50d1fe60d5fcfbfabef2bb1a3cc7047b5f86374e8279f82a343ee4e6f03L96-L98.

@droberts195 droberts195 merged commit 0559dd0 into elastic:master Dec 14, 2021
@droberts195 droberts195 deleted the add_stats_for_model_upgrade branch December 14, 2021 08:31
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Dec 14, 2021
Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes elastic#81519
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.0
7.16 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 81641

elasticsearchmachine pushed a commit that referenced this pull request Dec 14, 2021
Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes #81519
elasticsearchmachine pushed a commit that referenced this pull request Dec 14, 2021
* [7.16] [ML] Model snapshot upgrade needs a stats endpoint

Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Backport of #81641

* Fixing compilation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:Clients Meta label for clients team Team:ML Meta label for the ML team v7.16.2 v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] External components cannot accurately tell when model snapshot upgrade is running
7 participants