Skip to content

[ML] Job.getModelSnapshotMinVersion() is not populated as expected #78377

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
droberts195 opened this issue Sep 28, 2021 · 1 comment
Open
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team team-discuss

Comments

@droberts195
Copy link
Contributor

Since the ML config migration work done for 6.6 in #36702 the modelSnapshotMinVersion field of the Job class has not been populated.

The problem it was intended to solve was this:

  1. Job is opened with a version 6.8 model snapshot in a pure 6.8 cluster
  2. Cluster has a rolling upgrade to 7.15
  3. Job moves to a 7.15 ML node during the rolling upgrade
  4. Job persists a new model snapshot that can only be loaded on nodes from 7.6 and above
  5. 7.15 node the job is on dies due to hardware failure
  6. Job gets reassigned to a different node and the node that is chosen must be another 7.15 node, not a 6.8 node

(Obviously the exact versions in that example can vary, but the principle is that new model snapshots shouldn't get loaded on old nodes because the C++ code won't understand them.)

There hasn't been a single bug reported caused by this gap in protection, so it's very much a theoretical problem that doesn't affect many (or any) users in reality.

We should think about what to do about this in the long term though. Some options are:

  1. Assume that we don't need protection against this scenario, and remove the field. It's not been present in any job configs for jobs created since 6.6 and nobody has complained. We have to tolerate it for BWC purposes, but we could adopt a parse-and-discard approach in the code, and mark it deprecated in the docs.
  2. Implement protection against this scenario using the existing field. This would involve setting the model_snapshot_min_version field on the job config from its active model snapshot document when opening it and when it moves nodes. The "when it moves nodes" part is difficult, as it's not easy to do a search from a cluster state listener.
  3. Implement protection against this scenario using a new field on the job persistent task params. On job open this would get populated by looking at the active model snapshot, and then each time the job persists a new model snapshot it would get updated. The tricky part here is updating the job persistent task params when a new model snapshot is persisted. There is no official way to update persistent task params for a running persistent task. Potentially we could add a params update mechanism, or hack an update by modifying the persistent tasks cluster state without going through official persistent tasks methods.
  4. Implement protection against this scenario using a new field on the job persistent task state. On job open this would get populated by looking at the active model snapshot, and then each time the job persists a new model snapshot it would get updated. In contrast to option 3, updating the persistent task state is a supported and common thing to do.

These options are not completely mutually exclusive, because option 1 can be done as well as option 3 or 4. So the possibilities are 1, 2, 3, 4, 1 + 3, 1 + 4. It seems like 2 is just a bad idea - complex to implement without giving any unique benefit. I think we should do 1 and then consider adding 4 afterwards.

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Sep 28, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team team-discuss
Projects
None yet
Development

No branches or pull requests

2 participants