You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the ML config migration work done for 6.6 in #36702 the modelSnapshotMinVersion field of the Job class has not been populated.
The problem it was intended to solve was this:
Job is opened with a version 6.8 model snapshot in a pure 6.8 cluster
Cluster has a rolling upgrade to 7.15
Job moves to a 7.15 ML node during the rolling upgrade
Job persists a new model snapshot that can only be loaded on nodes from 7.6 and above
7.15 node the job is on dies due to hardware failure
Job gets reassigned to a different node and the node that is chosen must be another 7.15 node, not a 6.8 node
(Obviously the exact versions in that example can vary, but the principle is that new model snapshots shouldn't get loaded on old nodes because the C++ code won't understand them.)
There hasn't been a single bug reported caused by this gap in protection, so it's very much a theoretical problem that doesn't affect many (or any) users in reality.
We should think about what to do about this in the long term though. Some options are:
Assume that we don't need protection against this scenario, and remove the field. It's not been present in any job configs for jobs created since 6.6 and nobody has complained. We have to tolerate it for BWC purposes, but we could adopt a parse-and-discard approach in the code, and mark it deprecated in the docs.
Implement protection against this scenario using the existing field. This would involve setting the model_snapshot_min_version field on the job config from its active model snapshot document when opening it and when it moves nodes. The "when it moves nodes" part is difficult, as it's not easy to do a search from a cluster state listener.
Implement protection against this scenario using a new field on the job persistent task params. On job open this would get populated by looking at the active model snapshot, and then each time the job persists a new model snapshot it would get updated. The tricky part here is updating the job persistent task params when a new model snapshot is persisted. There is no official way to update persistent task params for a running persistent task. Potentially we could add a params update mechanism, or hack an update by modifying the persistent tasks cluster state without going through official persistent tasks methods.
Implement protection against this scenario using a new field on the job persistent task state. On job open this would get populated by looking at the active model snapshot, and then each time the job persists a new model snapshot it would get updated. In contrast to option 3, updating the persistent task state is a supported and common thing to do.
These options are not completely mutually exclusive, because option 1 can be done as well as option 3 or 4. So the possibilities are 1, 2, 3, 4, 1 + 3, 1 + 4. It seems like 2 is just a bad idea - complex to implement without giving any unique benefit. I think we should do 1 and then consider adding 4 afterwards.
The text was updated successfully, but these errors were encountered:
Since the ML config migration work done for 6.6 in #36702 the
modelSnapshotMinVersion
field of theJob
class has not been populated.The problem it was intended to solve was this:
(Obviously the exact versions in that example can vary, but the principle is that new model snapshots shouldn't get loaded on old nodes because the C++ code won't understand them.)
There hasn't been a single bug reported caused by this gap in protection, so it's very much a theoretical problem that doesn't affect many (or any) users in reality.
We should think about what to do about this in the long term though. Some options are:
model_snapshot_min_version
field on the job config from its active model snapshot document when opening it and when it moves nodes. The "when it moves nodes" part is difficult, as it's not easy to do a search from a cluster state listener.These options are not completely mutually exclusive, because option 1 can be done as well as option 3 or 4. So the possibilities are 1, 2, 3, 4, 1 + 3, 1 + 4. It seems like 2 is just a bad idea - complex to implement without giving any unique benefit. I think we should do 1 and then consider adding 4 afterwards.
The text was updated successfully, but these errors were encountered: