[ML] Job.getModelSnapshotMinVersion() is not populated as expected #78377

droberts195 · 2021-09-28T12:20:50Z

Since the ML config migration work done for 6.6 in #36702 the modelSnapshotMinVersion field of the Job class has not been populated.

The problem it was intended to solve was this:

Job is opened with a version 6.8 model snapshot in a pure 6.8 cluster
Cluster has a rolling upgrade to 7.15
Job moves to a 7.15 ML node during the rolling upgrade
Job persists a new model snapshot that can only be loaded on nodes from 7.6 and above
7.15 node the job is on dies due to hardware failure
Job gets reassigned to a different node and the node that is chosen must be another 7.15 node, not a 6.8 node

(Obviously the exact versions in that example can vary, but the principle is that new model snapshots shouldn't get loaded on old nodes because the C++ code won't understand them.)

There hasn't been a single bug reported caused by this gap in protection, so it's very much a theoretical problem that doesn't affect many (or any) users in reality.

We should think about what to do about this in the long term though. Some options are:

Assume that we don't need protection against this scenario, and remove the field. It's not been present in any job configs for jobs created since 6.6 and nobody has complained. We have to tolerate it for BWC purposes, but we could adopt a parse-and-discard approach in the code, and mark it deprecated in the docs.
Implement protection against this scenario using the existing field. This would involve setting the model_snapshot_min_version field on the job config from its active model snapshot document when opening it and when it moves nodes. The "when it moves nodes" part is difficult, as it's not easy to do a search from a cluster state listener.
Implement protection against this scenario using a new field on the job persistent task params. On job open this would get populated by looking at the active model snapshot, and then each time the job persists a new model snapshot it would get updated. The tricky part here is updating the job persistent task params when a new model snapshot is persisted. There is no official way to update persistent task params for a running persistent task. Potentially we could add a params update mechanism, or hack an update by modifying the persistent tasks cluster state without going through official persistent tasks methods.
Implement protection against this scenario using a new field on the job persistent task state. On job open this would get populated by looking at the active model snapshot, and then each time the job persists a new model snapshot it would get updated. In contrast to option 3, updating the persistent task state is a supported and common thing to do.

These options are not completely mutually exclusive, because option 1 can be done as well as option 3 or 4. So the possibilities are 1, 2, 3, 4, 1 + 3, 1 + 4. It seems like 2 is just a bad idea - complex to implement without giving any unique benefit. I think we should do 1 and then consider adding 4 afterwards.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-09-28T12:20:52Z

Pinging @elastic/ml-core (Team:ML)

droberts195 added >bug :ml Machine learning team-discuss labels Sep 28, 2021

elasticmachine added the Team:ML Meta label for the ML team label Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Job.getModelSnapshotMinVersion() is not populated as expected #78377

[ML] Job.getModelSnapshotMinVersion() is not populated as expected #78377

droberts195 commented Sep 28, 2021

elasticmachine commented Sep 28, 2021

[ML] Job.getModelSnapshotMinVersion() is not populated as expected #78377

[ML] Job.getModelSnapshotMinVersion() is not populated as expected #78377

Comments

droberts195 commented Sep 28, 2021

elasticmachine commented Sep 28, 2021