[ML] Model snapshot upgrade needs a stats endpoint #81641

droberts195 · 2021-12-13T10:09:01Z

Previously the ML model snapshot upgrade endpoint did not
provide a way to reliably monitor progress. This could lead
to the upgrade assistant UI thinking that a model snapshot
upgrade had finished when it actually hadn't.

This change adds a new "stats" API that allows external
interested parties to find out the status of each model
snapshot upgrade and which node (if any) each is running on.

Fixes #81519

Docs preview: https://elasticsearch_81641.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/ml-get-job-model-snapshot-upgrade-stats.html

Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes elastic#81519

elasticmachine · 2021-12-13T10:09:07Z

Pinging @elastic/ml-core (Team:ML)

benwtrent

good stuff. Some minor things

benwtrent · 2021-12-13T13:02:34Z

docs/reference/ml/anomaly-detection/apis/get-job-model-snapshot-upgrade-stats.asciidoc

+== {api-description-title}
+
+{anomaly-detect-cap} job model snapshot upgrades are ephemeral. Only
+upgrades that are in progress are the time this API is called will be


Suggested change

upgrades that are in progress are the time this API is called will be

upgrades that are in progress at the time this API is called will be

benwtrent · 2021-12-13T13:12:51Z

.../java/org/elasticsearch/xpack/ml/action/TransportGetJobModelSnapshotsUpgradeStatsAction.java

+                    return statsBuilder.setNode(state.getNodes().get(t.getExecutorNode()))
+                        .setUpgradeState(MlTasks.getSnapshotUpgradeState(t))
+                        .setAssignmentExplanation(t.getAssignment().getExplanation())
+                        .build();


t.getExecutorNode() could be null if the task isn't yet assigned. We should handle when if (t.getExecutorNode() == null)

I don't think it will make much difference to performance, but I'll change it if it makes the code clearer

state.getNodes().get(t.getExecutorNode())

I am more worried that state.getNodes().get(null) throws

davidkyle

LGTM

davidkyle · 2021-12-13T12:41:08Z

rest-api-spec/src/main/resources/rest-api-spec/api/ml.get_job_snapshot_upgrade_stats.json

+      "url":"https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-get-job-model-snapshot-upgrade-stats.html",
+      "description":"Gets stats for anomaly detection job model snapshot upgrades that are in progress."
+    },
+    "stability":"stable",


I don't think it's that strange. When we add whole swathes of new functionality like data frame analytics, transforms or trained models we start off those whole areas as experimental or beta. But this is adding a minor new API to an area that has already been classed as stable. To be honest it's extremely unlikely that any software apart from Kibana upgrade assistant would use the model snapshot upgrade functionality, but if it did then it's been shown that the model snapshot upgrade functionality cannot be reliably used without this API. So it doesn't make sense for upgrade model snapshot to be a required part of the process of upgrading to version 8, yet this part of it isn't supported.

davidkyle · 2021-12-13T12:57:51Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/upgrade_job_snapshot.yml

@@ -17,17 +17,96 @@ setup:
            }
          }

+  # It's too hard to create a genuine model snapshot in a YAML test.
+  # All we can do is create the descriptor doc that will allow the
+  # endpoints to get past an existence chech. Then the actual snapshot


Suggested change

# endpoints to get past an existence chech. Then the actual snapshot

# endpoints to get past an existence check. Then the actual snapshot

elasticmachine · 2021-12-13T13:33:14Z

Pinging @elastic/clients-team (Team:Clients)

sethmlarson

Few comments: This PR is tagged as 7.16.2 and 8.0.0 which are past feature freeze, is that intentional? We typically don't add new APIs in patch releases.

One thought from me is this API seems like it might be better as get_model_snapshots_stats instead of get_job_model_snapshot_upgrade_stats and then all flavors of stats about a model snapshot could be included in this API, each in their own nested object.

I note this because I remember seeing a separate trained model stats API getting rolled into get_trained_model_stats. Does that also make sense here?

droberts195 · 2021-12-13T14:53:52Z

@sethmlarson

This PR is tagged as 7.16.2 and 8.0.0 which are past feature freeze, is that intentional?

Yes, this has to go into 7.last. We're requiring users upgrade old anomaly detection job model snapshots before moving to version 8, but #81519 shows that the functionality we have so far for this doesn't work. We tried to think of another way to make it work that didn't involve adding another endpoint, but couldn't think of one.

Having said that, I don't think it's critical for the clients to add support for this API immediately. The only place I can think of where it will be critical to call it is in the Kibana upgrade assistant, and I believe that will be able to make an ES call that doesn't have high level client support. So it's fine if it's made available in the clients in 8.1 or later.

One thought from me is this API seems like it might be better as get_model_snapshots_stats instead of get_job_model_snapshot_upgrade_stats and then all flavors of stats about a model snapshot could be included in this API, each in their own nested object.

I note this because I remember seeing a separate trained model stats API getting rolled into get_trained_model_stats. Does that also make sense here?

There's quite a lot of history here. In the code the word "job" refers to what we refer to externally as "anomaly detection jobs" in documentation and anomaly_detectors in endpoint URLs. It dates back to the days when we only had one type of job. So the word job in get_job_model_snapshot_upgrade_stats should arguably be anomaly_detector, but then it would be inconsistent with all the other files in that directory that have job in their names. Given where we are it's probably the less of two evils to maintain the current pattern than to try to switch but only for the most obscure and specialized late-added endpoints.

"Trained models" are used with inference, and are completely different to anomaly detection models. We wouldn't want an API that combined both in any way.

As for whether we have upgrade stats or just stats, the thing about upgrades is that they cause an ML C++ process to run on one node in the cluster. This endpoint is primarily about finding out if such a process is running/is supposed to be running. We expect these upgrades to be very few and far between. At the moment these "stats" don't even contain any numerical stats - the only "stats" are the node and state. This is the bare minimum necessary make upgrade assistant work in 7.last. Possibly in the future there could be some numerical stats too, and the pattern that the node and state are returned by the "stats" endpoint is established by all our other endpoints that return these (get job stats, get datafeed stats, get data frame analytics stats).

droberts195 · 2021-12-13T18:15:55Z

rest-api-spec/src/main/resources/rest-api-spec/api/ml.get_job_snapshot_upgrade_stats.json

@@ -0,0 +1,40 @@
+{
+  "ml.get_job_snapshot_upgrade_stats":{


After discussing more with @sethmlarson we've decided to change this to ml.get_model_snapshot_upgrade_stats.

job_snapshot used originally here is consistent with ml.upgrade_job_snapshot.

But model_snapshot is consistent with all the other endpoints that relate to model snapshots.

sethmlarson

Changes look good, one non-blocking comment:

sethmlarson · 2021-12-13T18:29:25Z

x-pack/plugin/src/yamlRestTest/resources/rest-api-spec/test/ml/upgrade_job_snapshot.yml

+        snapshot_id: "9999999999"
+
+---
+"Test stats all snapshots":


Currently there are no requests to the get_model_snapshot_upgrade_stats API in YAML tests that aren't either a 404 or skipped. Could we add a test that isn't skipped that has a valid response so the clients team can create specs for and test the response structure?

Adding that test in this PR would involve fixing #81578 in this same PR. I was going to fix that and unmute the muted test at the end of the file in a separate followup PR this week, just to keep the sizes of the PRs manageable.

The YAML test framework isn't rich enough to test upgrade of a model snapshot that's created by the test itself. But the test at the end that gets the stats for upgrade of the corrupt model snapshot will return the same fields that getting stats for upgrade of a valid snapshot would get. So could you wait until the end of the week before creating specs and testing the response structure?

Merging this and fixing the test later works for me! I figured that was the plan.

@sethmlarson the test that should give the flight recorder a response it can get the structure from is unmuted in #81831 - see https://github.com/elastic/elasticsearch/pull/81831/files#diff-0a85b50d1fe60d5fcfbfabef2bb1a3cc7047b5f86374e8279f82a343ee4e6f03L96-L98.

Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes elastic#81519

elasticsearchmachine · 2021-12-14T08:33:13Z

💔 Backport failed

Status	Branch	Result
✅	8.0
❌	7.16	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 81641

Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Fixes #81519

* [7.16] [ML] Model snapshot upgrade needs a stats endpoint Previously the ML model snapshot upgrade endpoint did not provide a way to reliably monitor progress. This could lead to the upgrade assistant UI thinking that a model snapshot upgrade had finished when it actually hadn't. This change adds a new "stats" API that allows external interested parties to find out the status of each model snapshot upgrade and which node (if any) each is running on. Backport of #81641 * Fixing compilation

droberts195 added >bug :ml Machine learning v8.0.0 v8.1.0 v7.16.2 labels Dec 13, 2021

elasticmachine added the Team:ML Meta label for the ML team label Dec 13, 2021

Formatting

b473485

droberts195 requested a review from benwtrent December 13, 2021 10:24

droberts195 added 2 commits December 13, 2021 11:13

Add new action to non-operator actions

44285bd

Add utility method

171e621

benwtrent reviewed Dec 13, 2021

View reviewed changes

davidkyle approved these changes Dec 13, 2021

View reviewed changes

sethmlarson added the Team:Clients Meta label for clients team label Dec 13, 2021

sethmlarson reviewed Dec 13, 2021

View reviewed changes

droberts195 added 2 commits December 13, 2021 14:07

Addressing review comments

a4058e1

Add a null check

30b4c99

benwtrent approved these changes Dec 13, 2021

View reviewed changes

droberts195 added the auto-backport-and-merge label Dec 13, 2021

droberts195 commented Dec 13, 2021

View reviewed changes

droberts195 added 2 commits December 13, 2021 18:17

Merge branch 'master' into add_stats_for_model_upgrade

6c56a41

Renaming spec file

08ac379

sethmlarson approved these changes Dec 13, 2021

View reviewed changes

[DOCS] Adds API to navigation

af1c215

lcawl mentioned this pull request Dec 13, 2021

[DOCS] Fix formatting in get anomaly job API #81682

Merged

droberts195 merged commit 0559dd0 into elastic:master Dec 14, 2021

droberts195 deleted the add_stats_for_model_upgrade branch December 14, 2021 08:31

droberts195 mentioned this pull request Dec 14, 2021

[8.0] [ML] Model snapshot upgrade needs a stats endpoint (#81641) #81704

Merged

droberts195 mentioned this pull request Dec 14, 2021

[7.16] [ML] Model snapshot upgrade needs a stats endpoint #81706

Merged

droberts195 mentioned this pull request Dec 15, 2021

Fix upgrade assistant tracking of ML job model snapshot upgrade status elastic/kibana#121313

Closed

This was referenced Dec 20, 2021

[Upgrade Assistant] Fix ml model upgrade status check elastic/kibana#121617

Closed

[Upgrade Assistant] Fix ml upgrade status check elastic/kibana#121646

Merged

	upgrades that are in progress are the time this API is called will be
	upgrades that are in progress at the time this API is called will be

	# endpoints to get past an existence chech. Then the actual snapshot
	# endpoints to get past an existence check. Then the actual snapshot

[ML] Model snapshot upgrade needs a stats endpoint #81641

[ML] Model snapshot upgrade needs a stats endpoint #81641

Uh oh!

Conversation

droberts195 commented Dec 13, 2021 • edited by lcawl Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Dec 13, 2021

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benwtrent Dec 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidkyle left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticmachine commented Dec 13, 2021

Uh oh!

sethmlarson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

droberts195 commented Dec 13, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sethmlarson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elasticsearchmachine commented Dec 14, 2021

💔 Backport failed

Uh oh!

Uh oh!

droberts195 commented Dec 13, 2021 •

edited by lcawl

Loading

benwtrent Dec 13, 2021 •

edited

Loading

sethmlarson left a comment •

edited

Loading