About models, ITransform, and IDataLoader, saving/loading #3025

TomFinley · 2019-03-19T23:24:09Z

So, I had a conversation with @eerhardt and @yaeldekel, about the nature of models, in particular relating to the saving and loading routines. This is very important for us to get right, since the artifacts of what we learn and how we transform data, and their persistability, is probably the most important thing we have to do correctly.

We take the view initially that the model is the ITransformer (note that a chain of ITransformers is itself an ITransformer). But, by itself this is an insufficient description, was we saw in in #2663 and its subsequent "child" #2735, from the point of view of model being practically "the stuff you need to keep," there's a lot more to a machine learning model than merely just the ITransformer chain -- you also need to preserve some notion of what the input is to that. So we added these things to take either a loader, or the input schema, to be saved as part of the model.

Yet, is the loader a model itself? Sometimes that's precisely what we call it:

machinelearning/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs

Line 36 in c38f81b

public void Save<TSource>(IDataLoader<TSource> model, Stream stream)

And in the same file we call it something else:

machinelearning/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs

Line 65 in c38f81b

    
           public void Save<TSource>(IDataLoader<TSource> loader, ITransformer model, Stream stream) =>

It is a model in one sense because it is yet another things that takes input data and produces output data -- the fact that ITransformer does it specifically over IDataView as input specifically does not give it some magical, special status to allow it to be called a model, to the exclusion of other candidates. If I take a loader, and append a transform to it, then the whole aggregate thing is still a loader. If it isn't a model, it only isn't one by the mere skin of its teeth. Hence the presence of the original thing, and why we have in the model operations catalog operations to save and load IDataLoader itself specifically.

But at the same time this duality of the term "model" is, as I understand @eerhardt, confusing. We have two things we're calling model. In an ideal world, I feel like if we can get away with just one story of what the model is, we should take it, and if there must be only one it must be ITransformer. We even have the situation where if you have mlContext.Model.Save(, the first thing that pops up is the IDataLoader thing, which is kind of strange.

I am not quite sure what I think, since in this case I agree with whoever talks to me last with an even a vaguely convincing argument. But I think in this case I will see about getting rid of the IDataLoader-only APIs -- people can, if it is important, continue to save and load such things by using empty ITransformer chains (again, any chain of ITransformer is itself an ITransformer, including the empty chain).

Since we are approaching v1, I view it as a bit more important to be conservative w.r.t. potentially confusing additions to the API, especially around something as central as the saving and loading of models. We might be able to add it back later if there's some really compelling scenario for them, that we somehow did not anticipate.

We will of course retain the saving and loading of transformers with loaders, since that is really important to be able to capture, but I think being consistent around the story that "models are transformers" as we are most places is kind of important.

The text was updated successfully, but these errors were encountered:

TomFinley · 2019-03-20T02:04:13Z

A side effect of this work is that we still want to support a loader only pipeline -- we just don't want to encourage that as being somehow the primary use case. (Which, against our desires, has arguably happened with how intellisense is presenting things.) So I will support as a convenience null as an input for ITransformer, with the documentation that this is intended as a shorthand for an empty chain (which is to say, the trivial transformer that returns its input as output).

There's also an annoying inconsistency in the API: we describe the inputs in different positions. Here we see the input coming first.

machinelearning/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs

Line 65 in c38f81b

    
           public void Save<TSource>(IDataLoader<TSource> loader, ITransformer model, Stream stream) =>

Here it comes second.

machinelearning/src/Microsoft.ML.Data/Model/ModelOperationsCatalog.cs

Line 106 in c38f81b

    
           public void Save(ITransformer model, DataViewSchema inputSchema, string filePath)

I feel like we ought to be consistent. While the order in which things are done suggests putting the input first (since input feeds into transformer), I wonder if emphasizing the common thing they all have in common (they take ITransformer) as the more fundamental object is not more correct. I do not have a strong opinion here, but will provisionally perhaps put ITransformer first.

yaeldekel · 2019-03-20T16:31:46Z

I agree that too many save/load APIs can be confusing (especially if the first one that shows up in intellisense is the one least commonly used). I added the IDataLoader-only API for convenience since sometimes the whole pipeline can be contained in an IDataLoader of type CompositeDataLoader, so if it makes things inconvenient, it should definitely not be there :-).
I would perhaps suggest that we add a property TransformerChain.Empty. It may be a bit more convenient in the case of saving a loader when there are no transformers in the pipeline. It would be this:

ml.Model.Save(loader, TransformerChain.Empty)

instead of this:

ml.Model.Save(loader, new TransformerChain<ITransformer>());

Although if there is a way to put the Empty property somewhere more discoverable, that would be nice. I can't think of anywhere...

Sorry @TomFinley, I guess I should have read your comment before writing mine... passing null and instantiating the empty transform chain internally is even easier than what I suggested :-).

TomFinley · 2019-03-20T17:00:37Z

Sorry @TomFinley, I guess I should have read your comment before writing mine... passing null and instantiating the empty transform chain internally is even easier than what I suggested :-).

Cool thanks @yaeldekel.

Separate note: Another thing I notice is that when we sometimes save a transformer, when we deserialize we wind up with a chain of that transformer, which is not quite what I expected, so I'm going to make it so that the loader and saver save the same type of thing.

TomFinley · 2019-03-20T18:25:21Z

Oh boy... and while doing this work I see right smack dab in the middle of this, this:

machinelearning/src/Microsoft.ML.Data/DataLoadSave/TransformerChain.cs

Line 64 in e00d19d

ITransformer[] ITransformerChainAccessor.Transformers => _transformers;

A mutable array, right in the middle of a transformer. Given the tight time constraints I'm somewhat disinclined to file a separate issue and PR for that specifically.

Edit: Whoops, explicit interface. It's fine. 😄

TomFinley · 2019-03-20T19:21:41Z

Separate note: Another thing I notice is that when we sometimes save a transformer, when we deserialize we wind up with a chain of that transformer, which is not quite what I expected, so I'm going to make it so that the loader and saver save the same type of thing.

This badness actually worked both ways: when you saved an IDataLoader that happened to be a composite loader, then used the load, it would actually decompose the loader for you, even though you had explicitly told it to save a single loader! Anyway, that is also now fixed.

glebuk · 2019-03-21T00:06:28Z

Note that with the current code, if you load model without IDataLoader using the wrong .Load() method,
you would get a very "helpful" exception:
System.InvalidOperationException Message=Model does not contain an IDataLoader FormatException: **Corrupt model file**

This is not expected and completely confusing.

TomFinley · 2019-03-21T14:35:01Z

Note that with the current code, if you load model without IDataLoader using the wrong .Load() method,
you would get a very "helpful" exception:
System.InvalidOperationException Message=Model does not contain an IDataLoader FormatException: **Corrupt model file**

This is not expected and completely confusing.

Well, anyway, I'm getting rid of that method as described above, and LoadWithDataLoader will be the only way to service this scenario going forward, and the load/save methods will treat ITransformer as the first class citizen from now on, with description of the inputs (whether loaders or schemas) being considered secondary information. So hopefully the situation will be less inherently confusing, which I agree it was. (I generally find overloads that return different types are often confusing.)

On top of that, I also made the advice in the exception message more descriptive, to indicate why this could be so, and to suggest an alternate way.

throw _env.Except(ex, "Model does not contain an " + nameof(IDataLoader<IMultiStreamSource>) +
    ". Perhaps this was saved with an " + nameof(DataViewSchema) + ", or even no information on its input at all. " +
    "Consider using the " + nameof(Load) + " method instead.");

Regarding whether it was confusing or not, I'm not sure. Certainly the message was completely accurate, but as we see I added a lot more description and details. Anyway I'll post the PR soon...

TomFinley self-assigned this Mar 20, 2019

This was referenced Mar 21, 2019

Remove model saving/loading inconsistencies #3044

Merged

SlotNames for TextLoader are lost #2663

Closed

shauheen added this to the 0319 milestone Mar 21, 2019

TomFinley closed this as completed in #3044 Mar 22, 2019

TomFinley mentioned this issue Mar 26, 2019

Rename Save API for DataLoader to SaveDataLoader. #3019

Closed

ghost locked as resolved and limited conversation to collaborators Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About models, ITransform, and IDataLoader, saving/loading #3025

About models, ITransform, and IDataLoader, saving/loading #3025

TomFinley commented Mar 19, 2019 •

edited

Loading

TomFinley commented Mar 20, 2019

yaeldekel commented Mar 20, 2019 •

edited

Loading

TomFinley commented Mar 20, 2019 •

edited

Loading

TomFinley commented Mar 20, 2019 •

edited

Loading

TomFinley commented Mar 20, 2019

glebuk commented Mar 21, 2019

TomFinley commented Mar 21, 2019

About models, ITransform, and IDataLoader, saving/loading #3025

About models, ITransform, and IDataLoader, saving/loading #3025

Comments

TomFinley commented Mar 19, 2019 • edited Loading

TomFinley commented Mar 20, 2019

yaeldekel commented Mar 20, 2019 • edited Loading

TomFinley commented Mar 20, 2019 • edited Loading

TomFinley commented Mar 20, 2019 • edited Loading

TomFinley commented Mar 20, 2019

glebuk commented Mar 21, 2019

TomFinley commented Mar 21, 2019

TomFinley commented Mar 19, 2019 •

edited

Loading

yaeldekel commented Mar 20, 2019 •

edited

Loading

TomFinley commented Mar 20, 2019 •

edited

Loading

TomFinley commented Mar 20, 2019 •

edited

Loading