Skip to content

Add save/load APIs for IDataLoader #2858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Mar 18, 2019
Merged

Conversation

yaeldekel
Copy link

Fixes #2735.

@yaeldekel
Copy link
Author

yaeldekel commented Mar 5, 2019

Copying all the comments from #2850. #Resolved

@yaeldekel
Copy link
Author

yaeldekel commented Mar 5, 2019

    public void SaveTo(IHostEnvironment env, Stream outputStream)

Do we want this public API on CompositeDataLoader?
#Resolved


Refers to: src/Microsoft.ML.Data/DataLoadSave/CompositeDataLoader.cs:104 in 495740e. [](commit_id = 495740e, deletion_comment = False)

@@ -32,13 +32,34 @@ internal ModelOperationsCatalog(IHostEnvironment env)
/// <param name="stream">A writeable, seekable stream to save to.</param>
public void Save(ITransformer model, Stream stream) => model.SaveTo(_env, stream);

public void Save<TSource>(IDataLoader<TSource> model, Stream stream)
Copy link
Author

@yaeldekel yaeldekel Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save [](start = 20, length = 4)

Add xml comments for new public APIs. #Resolved

/// <summary>
/// Load the model from the stream.
/// </summary>
/// <param name="stream">A readable, seekable stream to load from.</param>
/// <returns>The loaded model.</returns>
public ITransformer Load(Stream stream) => TransformerChain.LoadFrom(_env, stream);

public IDataLoader<IMultiStreamSource> LoadAsCompositeDataLoader(Stream stream)
Copy link
Author

@yaeldekel yaeldekel Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDataLoader [](start = 15, length = 11)

IDataLoader [](start = 15, length = 11)

The subtle differences between usage here and usage of the above might be a bit hard of a pill to swallow. #Closed

@yaeldekel
Copy link
Author

yaeldekel commented Mar 5, 2019

@TomFinley commented in #2850: Were we going to try to address the problem of saving schemas? Not all data sources we train a transform chain with will be coming from a loader. #Resolved

@yaeldekel
Copy link
Author

Yes, I will address that in the next iterations.


In reply to: 469851370 [](ancestors = 469851370)

/// <summary>
/// Load the model from the stream.
/// </summary>
/// <param name="stream">A readable, seekable stream to load from.</param>
/// <returns>The loaded model.</returns>
public ITransformer Load(Stream stream) => TransformerChain.LoadFrom(_env, stream);

public IDataLoader<IMultiStreamSource> LoadAsCompositeDataLoader(Stream stream)
Copy link
Author

@yaeldekel yaeldekel Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public IDataLoader LoadAsCompositeDataLoader(Stream stream) [](start = 8, length = 79)

Should this return CompositeDataLoader instead of IDataLoader? The name is: LoadAsCompositeDataLoader, which makes me think it should return a CompositeDataLoader
#Resolved

@@ -57,19 +57,19 @@
"Naive Calibration Executor",
NaiveCalibrator.LoaderSignature)]

[assembly: LoadableClass(typeof(ValueMapperCalibratedModelParameters<IPredictorProducing<float>, ICalibrator>), null, typeof(SignatureLoadModel),
[assembly: LoadableClass(typeof(ValueMapperCalibratedModelParameters<object, ICalibrator>), null, typeof(SignatureLoadModel),
Copy link
Author

@yaeldekel yaeldekel Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object [](start = 69, length = 6)

@TomFinley's comment from #2850:
This is a bit stronger medicine than what I was suggesting! (I don't object per se, but it did entail a bit more work.) I suppose it's a bit more "honest" than the old way -- marker interfaces are more or less a promise you can't keep -- but they are still useful on internal code where you have complete control over the implementation (as we do as they are internal).

So I think this should be fine, I just wonder, are we fully comfortable with this? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is in the public API, since when we load it from a serialized model this would be the type of the object. And then if you want to access SubModel (like in the test I added LoadModelAndExtractPredictor) you can't do it if it is generic on IPredictorProducing.

Another option, is to introduce an abstract class CalibratedModelParametersBase that is not generic, where SubModel is an object (I made this change to one of the classes in the last commit, so we can decide which option we prefer).

@TomFinley, @eerhardt


In reply to: 262684849 [](ancestors = 262684849)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to go with the abstract base class, I can do something similar for the PredictionTransformers, except there it would have to be an interface... between ITransformer and IPredictionTransformer<TModel>.


In reply to: 262739008 [](ancestors = 262739008,262684849)

Copy link
Contributor

@TomFinley TomFinley Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way of course is to just have the type restriction. Then these things become IPredictionTransformer<object> via covariance/contravariance (can't remember which this is at the moment). In fact I'd kind of like that even if we go this route. #Resolved

Copy link
Contributor

@TomFinley TomFinley Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other way of course is to just have the type restriction. Then these things become IPredictionTransformer<object> via covariance/contravariance (can't remember which this is at the moment). In fact I'd kind of like that even if we go this route, and introduce a "super-interface."
#Resolved

Copy link
Contributor

@TomFinley TomFinley Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we kind of needed it in the case you mentioned though, because calibrated models are a class, not an interface, and there's no such thing as these in/out modifiers on type parameters for classes as there are with interfaces. If we can use that to our advantage to avoid complicating the type hierarchy, I think I'd prefer to. I also feel pretty strongly that given that we know absolutely nothing about model parameters in more situations than just this, we have to at least be able to say that they are object. I think that's a reasonable ask? #Resolved

Copy link
Contributor

@TomFinley TomFinley Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I'm thinking about this now, and I don't like the idea of there being a super-interface, because we already have this hierarchy:

public interface IPredictionTransformer<out TModel> : ITransformer

And here:

public interface ISingleFeaturePredictionTransformer<out TModel> : IPredictionTransformer<TModel>

We could have another interface, another non-generic IPredictionTransformer with object Model { get; }, and another non-generic ISingleFeaturePredictionTransformer that descends from it, with the generic IPredictionTransformer. Or we could just make things so that we know we can always call them IPredictionTransformer<object>. #Resolved

@@ -32,13 +32,34 @@ internal ModelOperationsCatalog(IHostEnvironment env)
/// <param name="stream">A writeable, seekable stream to save to.</param>
public void Save(ITransformer model, Stream stream) => model.SaveTo(_env, stream);

public void Save<TSource>(IDataLoader<TSource> model, Stream stream)
Copy link
Author

@yaeldekel yaeldekel Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public void Save(IDataLoader model, Stream stream) [](start = 8, length = 68)

@TomFinley's comment from #2850:
This I think this is not correct... we need to save both the loader (or failing that, input schema) as well as the transform model to a stream. Not two separate things.

I recall the mistake that was made with predictor models and transform models in entry-points, where we viewed these as single things saved to a stream, as opposed to having to capture the entire pipeline. This mistake was inherited by ML.NET v0.1, which led to all sorts of grief when people realized, yes, saving the loader is really freakin' important. So I think we're just making the same mistake again. A model isn't just a transform model, or a loader, or an input schema, it I think needs to capture all of them, and I think we need a change a bit more fundamental than merely adding a few methods here and there. We could have done that after v1.0. I think the problem is that the API here is just misleading.

#Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an API to save an IDataLoader, but it can be used to save the entire model like this:

var loader = ml.Model.CreateTextLoader(...);
var pipeline = loader.Append(/* define a transformer here */);
var model = pipeline.Fit(/* give it an IMultiStreamSource */);
ml.Model.Save(model, fs);

model here is a CompositeDataLoader.
Perhaps I should define this method to take a CompositeDataLoader instead of an IDataLoader?


In reply to: 262685374 [](ancestors = 262685374)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the code above to the test.


In reply to: 262756744 [](ancestors = 262756744,262685374)

Copy link
Contributor

@TomFinley TomFinley Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be misunderstanding you, let me know if so, but I think I'm looking at the problem a bit differently. I believe what you're saying is that it's possible to do the right thing, and I acknowledge that, but I'd like our API to be changed in such a way that doing the right thing (saving the input schema with the model) becomes easy and convenient, and doing the wrong thing (not saving the input schema) becomes less convenient, or even impossible. (Since if an ITransformer was trained, surely there must have been an input schema somewhere in that process, and that is really important information.)

So, what I am saying is the current scheme for saving and loading models that takes a stream and an ITransformer and that's it, should go away, since it is confusing and leading people down the wrong path. It's not that I think your change doesn't make the right thing possible, I'm saying that by having the old way alongside it we're going to forever have the question of, "hey, I don't know my input features now after loading a model, how do I get them," and we have to forever explain, "yeah, we have this booby trap in our code, you should have used this other method, your model is now incomplete forever until you find some way to independently recreate the input schema." This will continue to be a problem, because, on the face of it, it may not be obvious to people why merely saving an ITransformer is insufficient. I'd rather never get that question, and I'd do that by removing the current API and replacing it with something better, that always takes a schema or loader. (And on the flip side of loading, always returns a loader or schema.) #Resolved

@codecov
Copy link

codecov bot commented Mar 5, 2019

Codecov Report

Merging #2858 into master will increase coverage by 0.09%.
The diff coverage is 88.56%.

@@            Coverage Diff             @@
##           master    #2858      +/-   ##
==========================================
+ Coverage   72.27%   72.36%   +0.09%     
==========================================
  Files         802      803       +1     
  Lines      142888   143257     +369     
  Branches    16132    16151      +19     
==========================================
+ Hits       103273   103675     +402     
+ Misses      35196    35161      -35     
- Partials     4419     4421       +2
Flag Coverage Δ
#Debug 72.36% <88.56%> (+0.09%) ⬆️
#production 68.08% <80.93%> (+0.07%) ⬆️
#test 88.53% <95.23%> (+0.05%) ⬆️
Impacted Files Coverage Δ
src/Microsoft.ML.Core/Data/IEstimator.cs 84.53% <ø> (ø) ⬆️
src/Microsoft.ML.TimeSeries/PredictionFunction.cs 87.2% <ø> (ø) ⬆️
...ansforms/PermutationFeatureImportanceExtensions.cs 97.93% <ø> (ø) ⬆️
...soft.ML.Transforms/PermutationFeatureImportance.cs 60% <ø> (ø) ⬆️
...c/Microsoft.ML.Data/Utilities/ComponentCreation.cs 42.98% <ø> (+1.8%) ⬆️
...rc/Microsoft.ML.FastTree/TreeEnsembleFeaturizer.cs 88.64% <0%> (ø) ⬆️
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs 72.68% <0%> (ø) ⬆️
...icrosoft.ML.Data/EntryPoints/PredictorModelImpl.cs 72.22% <0%> (ø) ⬆️
...StandardTrainers/Standard/StochasticTrainerBase.cs 88.37% <0%> (ø) ⬆️
...icrosoft.ML.Data/EntryPoints/SummarizePredictor.cs 72.54% <100%> (ø) ⬆️
... and 50 more

@yaeldekel yaeldekel requested review from TomFinley and eerhardt March 5, 2019 23:44
@artidoro artidoro self-requested a review March 6, 2019 00:22
@@ -236,7 +236,7 @@ IRowToRowMapper ITransformer.GetRowToRowMapper(DataViewSchema inputSchema)
/// <summary>
/// Saving/loading routines for transformer chains.
/// </summary>
internal static class TransformerChain
public static class TransformerChain
Copy link
Contributor

@TomFinley TomFinley Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TransformerChain [](start = 24, length = 16)

Why is this now public? This is definitely not something we intended for public consumption. We'd prefer for these things to be done via the MLContext's Model property, I'd think. This is not a suitable public API. #Resolved

@@ -32,13 +32,34 @@ internal ModelOperationsCatalog(IHostEnvironment env)
/// <param name="stream">A writeable, seekable stream to save to.</param>
public void Save(ITransformer model, Stream stream) => model.SaveTo(_env, stream);
Copy link
Contributor

@TomFinley TomFinley Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save [](start = 20, length = 4)

I'd also get rid of this. #Resolved

using Xunit;

namespace Microsoft.ML.Tests.Scenarios.Api
{
Copy link
Contributor

@artidoro artidoro Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these go in Microsoft.ML.Functional.Tests?
They seem to be in the same spirit of Rogan's tests. #Resolved

}

[Fact]
public void SaveAndLoadModelWithLoader()
Copy link
Contributor

@artidoro artidoro Mar 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also have a test for the scenario where the loader is part of the pipeline and we try to save the entire pipeline? Something that looks like the following:

var pipeline = mlContext.Data.CreateTextLoader(....)
                        .Append(mlContext.Transforms.Conversions.Hash(...)
                        .Append(....)

var model = pipeline.Fit(stream);

mlContext.Model.Save(model, fs);

#Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to make sure we can then use LoadAsCompositeDataLoader.


In reply to: 263153288 [](ancestors = 263153288)

@@ -13,6 +13,7 @@
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.InferenceTesting" + PublicKey.TestValue)]
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.StaticPipelineTesting" + PublicKey.TestValue)]
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.OnnxTransformerTest" + PublicKey.TestValue)]
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.Functional.Tests" + PublicKey.TestValue)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, the express purpose of this "functional tests" is that it does not have access to internals. This to avoid the problem of us inadvertantly adding something we need to internals, so we have one project where we can be sure, "these tests are using the actual power of the public API." See ref #2470, and its motivations. Maybe @rogancarr and @eerhardt can express more about this, though I think the motivations are captured adequately by the PR and its linked issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES! This should be noted to ALL devs - we must NEVER have InternalsVisibleTo the Functional.Tests. That defeats the purpose.

Note the comment in the Functional.Tests.csproj:

<PropertyGroup>
<!-- We are turning off strong naming to ensure we never add `InternalsVisibleTo` for these tests -->
<SignAssembly>false</SignAssembly>
<PublicSign>false</PublicSign>
</PropertyGroup>

Maybe we need to make that comment stand out more. @yaeldekel - can you make it ALL CAPS or something that will catch people's eye, so they don't make the same change/mistake you just did? Maybe put it before and after the SignAssembly property.

  <PropertyGroup>
    <!-- DON'T CHANGE THIS!!! We are turning off strong naming to ensure we never add `InternalsVisibleTo` for these tests -->
    <SignAssembly>false</SignAssembly>
    <!-- DON'T CHANGE THIS!!! We are turning off strong naming to ensure we never add `InternalsVisibleTo` for these tests -->
    <PublicSign>false</PublicSign>
  </PropertyGroup>

@@ -250,13 +250,15 @@ private static TransformerChain<ITransformer> Create(IHostEnvironment env, Model
public static void SaveTo(this ITransformer transformer, IHostEnvironment env, Stream outputStream)
=> new TransformerChain<ITransformer>(transformer).SaveTo(env, outputStream);

public static TransformerChain<ITransformer> LoadFrom(IHostEnvironment env, Stream stream)
public static ITransformer LoadFrom(IHostEnvironment env, Stream stream)
Copy link
Contributor

@TomFinley TomFinley Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ITransformer [](start = 22, length = 12)

Would it have been bad in the case where the source wasn't a chain (the @"Model" branch below, to just put it in a chain? As far as I understand, this is for backcompat. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up doing the opposite of that - so the transform chain is never nested in the Model directory.


In reply to: 264740060 [](ancestors = 264740060)

ModelLoadContext.LoadModel<TransformerChain<ITransformer>, SignatureLoadModel>(env, out var transformerChain, rep, LoaderSignature);
ModelLoadContext.LoadModelOrNull<ITransformer, SignatureLoadModel>(env, out var transformerChain, rep, LoaderSignature);
if (transformerChain == null)
ModelLoadContext.LoadModel<ITransformer, SignatureLoadModel>(env, out transformerChain, rep, $@"Model\{LoaderSignature}");
Copy link
Contributor

@TomFinley TomFinley Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ [](start = 117, length = 1)

I think this formatting is somewhat locale sensitive. I might have preferred the simpler "Model\\" + LoaderSignature. #Resolved

/// <returns>The loaded model.</returns>
public ITransformer Load(Stream stream) => TransformerChain.LoadFrom(_env, stream);
public ITransformer Load(Stream stream, out DataViewSchema inputSchema)
Copy link
Contributor

@TomFinley TomFinley Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataViewSchema inputSchema [](start = 52, length = 26)

Sometimes this will not be present. Should we clarify that this can be null in some cases? #Resolved

/// <returns>The loaded model.</returns>
public ITransformer Load(Stream stream) => TransformerChain.LoadFrom(_env, stream);
public ITransformer Load(Stream stream, out DataViewSchema inputSchema)
Copy link
Contributor

@TomFinley TomFinley Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Load(Stream stream, out DataViewSchema inputSchema [](start = 28, length = 50)

So I see above we have methods to save either a loader, or a schema, with a transform. I see this method to load the transform, with an input schema. What about to load the transformer, with the loader? #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding a Load overload that produces a transformer and a loader. If the model was saved by saving the transformer+schema, do we prefer throwing an exception in this case (since there is no loader), or defining this method as TryLoad?


In reply to: 264742808 [](ancestors = 264742808)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of Try suggests there's something wrong with the approach, and that the absence of a schema represents failure. I'm not sure I agree with that -- I think the situation where you have a transformer with no loader or input schema is a valid one we should allow, just not encourage (since it is a minority scenario). Now then, I think we should encourage people to think about saving it, and I think a good way to encourage people to make a deliberate decision is to force them to consciously feed us a null, as suggested above. But that also means that we have to allow for the possibility upon loading that the schema will be null, and I'd think we'd call this not an error condition. What you think?


In reply to: 265196459 [](ancestors = 265196459,264742808)

@@ -13,6 +17,11 @@ namespace Microsoft.ML
/// </summary>
public sealed class ModelOperationsCatalog : IInternalCatalog
{
internal const string LoaderDirectory = "Loader";
Copy link
Contributor

@TomFinley TomFinley Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoaderDirectory [](start = 30, length = 15)

So, as far as I can see, these are used entirely from CompositeDataLoader, though maybe I missed something. So why are they here? #Resolved

@sfilipi sfilipi added this to the 0319 milestone Mar 12, 2019
/// <param name="stream">A readable, seekable stream to load from.</param>
/// <param name="loader">The data loader from the model stream.</param>
/// <returns>The transformer model from the model stream.</returns>
public ITransformer Load(Stream stream, out IDataLoader<IMultiStreamSource> loader)
Copy link
Contributor

@TomFinley TomFinley Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Load [](start = 28, length = 4)

It would be nice if this was LoadWithDataLoader, so that we could have mlContext.Model.Load(stream, out var thing) without ambiguity. By having it be the same name, the var won't work, which requires them to either declare ahead of time a fairly elaborate signature. #Resolved

@@ -46,6 +46,57 @@ internal static class DataViewConstructionUtils
return new StreamingDataView<TRow>(env, data, internalSchemaDefn);
}

public static StreamingDataView<TRow> CreateFromEnumerable<TRow>(IHostEnvironment env, IEnumerable<TRow> data,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateFromEnumerable [](start = 46, length = 20)

So, I don['t object exactly, it seems positive, but I'm not sure how this relates to the original issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these so that I could add tests show what you can do with the schema once you load it with the new API.


In reply to: 266546399 [](ancestors = 266546399)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yaeldekel !

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="stream">A readable, seekable stream to load from.</param>
/// <returns>A model of type <see cref="CompositeDataLoader{IMultiStreamSource, ITransformer}"/> containing the loader
/// and the transformer chain.</returns>
public IDataLoader<IMultiStreamSource> Load(Stream stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDataLoader [](start = 15, length = 11)

it's kinda sad what now we have IDataLoader along side with ITransformer since it's increase complexity. Would be nice to have some documentation/samples which would explain in which case I need to save/load one thing, and in which another.

Also can I now have slot names in my model?

@yaeldekel yaeldekel merged commit 665a366 into dotnet:master Mar 18, 2019
@yaeldekel yaeldekel deleted the saveloader branch March 18, 2019 22:08
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants