Skip to content

About models, ITransform, and IDataLoader, saving/loading #3025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomFinley opened this issue Mar 19, 2019 · 7 comments
Closed

About models, ITransform, and IDataLoader, saving/loading #3025

TomFinley opened this issue Mar 19, 2019 · 7 comments
Assignees
Milestone

Comments

@TomFinley
Copy link
Contributor

TomFinley commented Mar 19, 2019

So, I had a conversation with @eerhardt and @yaeldekel, about the nature of models, in particular relating to the saving and loading routines. This is very important for us to get right, since the artifacts of what we learn and how we transform data, and their persistability, is probably the most important thing we have to do correctly.

We take the view initially that the model is the ITransformer (note that a chain of ITransformers is itself an ITransformer). But, by itself this is an insufficient description, was we saw in in #2663 and its subsequent "child" #2735, from the point of view of model being practically "the stuff you need to keep," there's a lot more to a machine learning model than merely just the ITransformer chain -- you also need to preserve some notion of what the input is to that. So we added these things to take either a loader, or the input schema, to be saved as part of the model.

Yet, is the loader a model itself? Sometimes that's precisely what we call it:

public void Save<TSource>(IDataLoader<TSource> model, Stream stream)

And in the same file we call it something else:

public void Save<TSource>(IDataLoader<TSource> loader, ITransformer model, Stream stream) =>

It is a model in one sense because it is yet another things that takes input data and produces output data -- the fact that ITransformer does it specifically over IDataView as input specifically does not give it some magical, special status to allow it to be called a model, to the exclusion of other candidates. If I take a loader, and append a transform to it, then the whole aggregate thing is still a loader. If it isn't a model, it only isn't one by the mere skin of its teeth. Hence the presence of the original thing, and why we have in the model operations catalog operations to save and load IDataLoader itself specifically.

But at the same time this duality of the term "model" is, as I understand @eerhardt, confusing. We have two things we're calling model. In an ideal world, I feel like if we can get away with just one story of what the model is, we should take it, and if there must be only one it must be ITransformer. We even have the situation where if you have mlContext.Model.Save(, the first thing that pops up is the IDataLoader thing, which is kind of strange.

I am not quite sure what I think, since in this case I agree with whoever talks to me last with an even a vaguely convincing argument. But I think in this case I will see about getting rid of the IDataLoader-only APIs -- people can, if it is important, continue to save and load such things by using empty ITransformer chains (again, any chain of ITransformer is itself an ITransformer, including the empty chain).

Since we are approaching v1, I view it as a bit more important to be conservative w.r.t. potentially confusing additions to the API, especially around something as central as the saving and loading of models. We might be able to add it back later if there's some really compelling scenario for them, that we somehow did not anticipate.

We will of course retain the saving and loading of transformers with loaders, since that is really important to be able to capture, but I think being consistent around the story that "models are transformers" as we are most places is kind of important.

@TomFinley
Copy link
Contributor Author

A side effect of this work is that we still want to support a loader only pipeline -- we just don't want to encourage that as being somehow the primary use case. (Which, against our desires, has arguably happened with how intellisense is presenting things.) So I will support as a convenience null as an input for ITransformer, with the documentation that this is intended as a shorthand for an empty chain (which is to say, the trivial transformer that returns its input as output).

There's also an annoying inconsistency in the API: we describe the inputs in different positions. Here we see the input coming first.

public void Save<TSource>(IDataLoader<TSource> loader, ITransformer model, Stream stream) =>

Here it comes second.

public void Save(ITransformer model, DataViewSchema inputSchema, string filePath)

I feel like we ought to be consistent. While the order in which things are done suggests putting the input first (since input feeds into transformer), I wonder if emphasizing the common thing they all have in common (they take ITransformer) as the more fundamental object is not more correct. I do not have a strong opinion here, but will provisionally perhaps put ITransformer first.

@TomFinley TomFinley self-assigned this Mar 20, 2019
@yaeldekel
Copy link

yaeldekel commented Mar 20, 2019

I agree that too many save/load APIs can be confusing (especially if the first one that shows up in intellisense is the one least commonly used). I added the IDataLoader-only API for convenience since sometimes the whole pipeline can be contained in an IDataLoader of type CompositeDataLoader, so if it makes things inconvenient, it should definitely not be there :-).
I would perhaps suggest that we add a property TransformerChain.Empty. It may be a bit more convenient in the case of saving a loader when there are no transformers in the pipeline. It would be this:

ml.Model.Save(loader, TransformerChain.Empty)

instead of this:

ml.Model.Save(loader, new TransformerChain<ITransformer>());

Although if there is a way to put the Empty property somewhere more discoverable, that would be nice. I can't think of anywhere...

Sorry @TomFinley, I guess I should have read your comment before writing mine... passing null and instantiating the empty transform chain internally is even easier than what I suggested :-).

@TomFinley
Copy link
Contributor Author

TomFinley commented Mar 20, 2019

Sorry @TomFinley, I guess I should have read your comment before writing mine... passing null and instantiating the empty transform chain internally is even easier than what I suggested :-).

Cool thanks @yaeldekel.

Separate note: Another thing I notice is that when we sometimes save a transformer, when we deserialize we wind up with a chain of that transformer, which is not quite what I expected, so I'm going to make it so that the loader and saver save the same type of thing.

@TomFinley
Copy link
Contributor Author

TomFinley commented Mar 20, 2019

Oh boy... and while doing this work I see right smack dab in the middle of this, this:

ITransformer[] ITransformerChainAccessor.Transformers => _transformers;

A mutable array, right in the middle of a transformer. Given the tight time constraints I'm somewhat disinclined to file a separate issue and PR for that specifically.

Edit: Whoops, explicit interface. It's fine. 😄

@TomFinley
Copy link
Contributor Author

Separate note: Another thing I notice is that when we sometimes save a transformer, when we deserialize we wind up with a chain of that transformer, which is not quite what I expected, so I'm going to make it so that the loader and saver save the same type of thing.

This badness actually worked both ways: when you saved an IDataLoader that happened to be a composite loader, then used the load, it would actually decompose the loader for you, even though you had explicitly told it to save a single loader! Anyway, that is also now fixed.

@glebuk
Copy link
Contributor

glebuk commented Mar 21, 2019

Note that with the current code, if you load model without IDataLoader using the wrong .Load() method,
you would get a very "helpful" exception:
System.InvalidOperationException Message=Model does not contain an IDataLoader FormatException: **Corrupt model file**

This is not expected and completely confusing.

@TomFinley
Copy link
Contributor Author

Note that with the current code, if you load model without IDataLoader using the wrong .Load() method,
you would get a very "helpful" exception:
System.InvalidOperationException Message=Model does not contain an IDataLoader FormatException: **Corrupt model file**

This is not expected and completely confusing.

Well, anyway, I'm getting rid of that method as described above, and LoadWithDataLoader will be the only way to service this scenario going forward, and the load/save methods will treat ITransformer as the first class citizen from now on, with description of the inputs (whether loaders or schemas) being considered secondary information. So hopefully the situation will be less inherently confusing, which I agree it was. (I generally find overloads that return different types are often confusing.)

On top of that, I also made the advice in the exception message more descriptive, to indicate why this could be so, and to suggest an alternate way.

throw _env.Except(ex, "Model does not contain an " + nameof(IDataLoader<IMultiStreamSource>) +
    ". Perhaps this was saved with an " + nameof(DataViewSchema) + ", or even no information on its input at all. " +
    "Consider using the " + nameof(Load) + " method instead.");

Regarding whether it was confusing or not, I'm not sure. Certainly the message was completely accurate, but as we see I added a lot more description and details. Anyway I'll post the PR soon...

@shauheen shauheen added this to the 0319 milestone Mar 21, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants