-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add save/load APIs for IDataLoader #2858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Copying all the comments from #2850. #Resolved |
@@ -32,13 +32,34 @@ internal ModelOperationsCatalog(IHostEnvironment env) | |||
/// <param name="stream">A writeable, seekable stream to save to.</param> | |||
public void Save(ITransformer model, Stream stream) => model.SaveTo(_env, stream); | |||
|
|||
public void Save<TSource>(IDataLoader<TSource> model, Stream stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save [](start = 20, length = 4)
Add xml comments for new public APIs. #Resolved
/// <summary> | ||
/// Load the model from the stream. | ||
/// </summary> | ||
/// <param name="stream">A readable, seekable stream to load from.</param> | ||
/// <returns>The loaded model.</returns> | ||
public ITransformer Load(Stream stream) => TransformerChain.LoadFrom(_env, stream); | ||
|
||
public IDataLoader<IMultiStreamSource> LoadAsCompositeDataLoader(Stream stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDataLoader [](start = 15, length = 11)
IDataLoader [](start = 15, length = 11)
The subtle differences between usage here and usage of the above might be a bit hard of a pill to swallow. #Closed
@TomFinley commented in #2850: Were we going to try to address the problem of saving schemas? Not all data sources we train a transform chain with will be coming from a loader. #Resolved |
Yes, I will address that in the next iterations. In reply to: 469851370 [](ancestors = 469851370) |
/// <summary> | ||
/// Load the model from the stream. | ||
/// </summary> | ||
/// <param name="stream">A readable, seekable stream to load from.</param> | ||
/// <returns>The loaded model.</returns> | ||
public ITransformer Load(Stream stream) => TransformerChain.LoadFrom(_env, stream); | ||
|
||
public IDataLoader<IMultiStreamSource> LoadAsCompositeDataLoader(Stream stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public IDataLoader LoadAsCompositeDataLoader(Stream stream) [](start = 8, length = 79)
Should this return CompositeDataLoader
instead of IDataLoader
? The name is: LoadAsCompositeDataLoader
, which makes me think it should return a CompositeDataLoader
#Resolved
@@ -57,19 +57,19 @@ | |||
"Naive Calibration Executor", | |||
NaiveCalibrator.LoaderSignature)] | |||
|
|||
[assembly: LoadableClass(typeof(ValueMapperCalibratedModelParameters<IPredictorProducing<float>, ICalibrator>), null, typeof(SignatureLoadModel), | |||
[assembly: LoadableClass(typeof(ValueMapperCalibratedModelParameters<object, ICalibrator>), null, typeof(SignatureLoadModel), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
object [](start = 69, length = 6)
@TomFinley's comment from #2850:
This is a bit stronger medicine than what I was suggesting! (I don't object per se, but it did entail a bit more work.) I suppose it's a bit more "honest" than the old way -- marker interfaces are more or less a promise you can't keep -- but they are still useful on internal code where you have complete control over the implementation (as we do as they are internal).
So I think this should be fine, I just wonder, are we fully comfortable with this? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is in the public API, since when we load it from a serialized model this would be the type of the object. And then if you want to access SubModel
(like in the test I added LoadModelAndExtractPredictor
) you can't do it if it is generic on IPredictorProducing
.
Another option, is to introduce an abstract class CalibratedModelParametersBase
that is not generic, where SubModel
is an object (I made this change to one of the classes in the last commit, so we can decide which option we prefer).
In reply to: 262684849 [](ancestors = 262684849)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to go with the abstract base class, I can do something similar for the PredictionTransformer
s, except there it would have to be an interface... between ITransformer
and IPredictionTransformer<TModel>
.
In reply to: 262739008 [](ancestors = 262739008,262684849)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other way of course is to just have the type restriction. Then these things become IPredictionTransformer<object>
via covariance/contravariance (can't remember which this is at the moment). In fact I'd kind of like that even if we go this route. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other way of course is to just have the type restriction. Then these things become
IPredictionTransformer<object>
via covariance/contravariance (can't remember which this is at the moment). In fact I'd kind of like that even if we go this route, and introduce a "super-interface."
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we kind of needed it in the case you mentioned though, because calibrated models are a class, not an interface, and there's no such thing as these in
/out
modifiers on type parameters for classes as there are with interfaces. If we can use that to our advantage to avoid complicating the type hierarchy, I think I'd prefer to. I also feel pretty strongly that given that we know absolutely nothing about model parameters in more situations than just this, we have to at least be able to say that they are object
. I think that's a reasonable ask? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I'm thinking about this now, and I don't like the idea of there being a super-interface, because we already have this hierarchy:
public interface IPredictionTransformer<out TModel> : ITransformer |
And here:
public interface ISingleFeaturePredictionTransformer<out TModel> : IPredictionTransformer<TModel> |
We could have another interface, another non-generic IPredictionTransformer
with object Model { get; }
, and another non-generic ISingleFeaturePredictionTransformer
that descends from it, with the generic IPredictionTransformer
. Or we could just make things so that we know we can always call them IPredictionTransformer<object>
. #Resolved
@@ -32,13 +32,34 @@ internal ModelOperationsCatalog(IHostEnvironment env) | |||
/// <param name="stream">A writeable, seekable stream to save to.</param> | |||
public void Save(ITransformer model, Stream stream) => model.SaveTo(_env, stream); | |||
|
|||
public void Save<TSource>(IDataLoader<TSource> model, Stream stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public void Save(IDataLoader model, Stream stream) [](start = 8, length = 68)
@TomFinley's comment from #2850:
This I think this is not correct... we need to save both the loader (or failing that, input schema) as well as the transform model to a stream. Not two separate things.
I recall the mistake that was made with predictor models and transform models in entry-points, where we viewed these as single things saved to a stream, as opposed to having to capture the entire pipeline. This mistake was inherited by ML.NET v0.1, which led to all sorts of grief when people realized, yes, saving the loader is really freakin' important. So I think we're just making the same mistake again. A model isn't just a transform model, or a loader, or an input schema, it I think needs to capture all of them, and I think we need a change a bit more fundamental than merely adding a few methods here and there. We could have done that after v1.0. I think the problem is that the API here is just misleading.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an API to save an IDataLoader
, but it can be used to save the entire model like this:
var loader = ml.Model.CreateTextLoader(...);
var pipeline = loader.Append(/* define a transformer here */);
var model = pipeline.Fit(/* give it an IMultiStreamSource */);
ml.Model.Save(model, fs);
model
here is a CompositeDataLoader
.
Perhaps I should define this method to take a CompositeDataLoader
instead of an IDataLoader
?
In reply to: 262685374 [](ancestors = 262685374)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be misunderstanding you, let me know if so, but I think I'm looking at the problem a bit differently. I believe what you're saying is that it's possible to do the right thing, and I acknowledge that, but I'd like our API to be changed in such a way that doing the right thing (saving the input schema with the model) becomes easy and convenient, and doing the wrong thing (not saving the input schema) becomes less convenient, or even impossible. (Since if an ITransformer
was trained, surely there must have been an input schema somewhere in that process, and that is really important information.)
So, what I am saying is the current scheme for saving and loading models that takes a stream and an ITransformer
and that's it, should go away, since it is confusing and leading people down the wrong path. It's not that I think your change doesn't make the right thing possible, I'm saying that by having the old way alongside it we're going to forever have the question of, "hey, I don't know my input features now after loading a model, how do I get them," and we have to forever explain, "yeah, we have this booby trap in our code, you should have used this other method, your model is now incomplete forever until you find some way to independently recreate the input schema." This will continue to be a problem, because, on the face of it, it may not be obvious to people why merely saving an ITransformer
is insufficient. I'd rather never get that question, and I'd do that by removing the current API and replacing it with something better, that always takes a schema or loader. (And on the flip side of loading, always returns a loader or schema.) #Resolved
Codecov Report
@@ Coverage Diff @@
## master #2858 +/- ##
==========================================
+ Coverage 72.27% 72.36% +0.09%
==========================================
Files 802 803 +1
Lines 142888 143257 +369
Branches 16132 16151 +19
==========================================
+ Hits 103273 103675 +402
+ Misses 35196 35161 -35
- Partials 4419 4421 +2
|
@@ -236,7 +236,7 @@ IRowToRowMapper ITransformer.GetRowToRowMapper(DataViewSchema inputSchema) | |||
/// <summary> | |||
/// Saving/loading routines for transformer chains. | |||
/// </summary> | |||
internal static class TransformerChain | |||
public static class TransformerChain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TransformerChain [](start = 24, length = 16)
Why is this now public? This is definitely not something we intended for public consumption. We'd prefer for these things to be done via the MLContext
's Model
property, I'd think. This is not a suitable public API. #Resolved
@@ -32,13 +32,34 @@ internal ModelOperationsCatalog(IHostEnvironment env) | |||
/// <param name="stream">A writeable, seekable stream to save to.</param> | |||
public void Save(ITransformer model, Stream stream) => model.SaveTo(_env, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save [](start = 20, length = 4)
I'd also get rid of this. #Resolved
using Xunit; | ||
|
||
namespace Microsoft.ML.Tests.Scenarios.Api | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these go in Microsoft.ML.Functional.Tests?
They seem to be in the same spirit of Rogan's tests. #Resolved
} | ||
|
||
[Fact] | ||
public void SaveAndLoadModelWithLoader() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also have a test for the scenario where the loader is part of the pipeline and we try to save the entire pipeline? Something that looks like the following:
var pipeline = mlContext.Data.CreateTextLoader(....)
.Append(mlContext.Transforms.Conversions.Hash(...)
.Append(....)
var model = pipeline.Fit(stream);
mlContext.Model.Save(model, fs);
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to make sure we can then use LoadAsCompositeDataLoader
.
In reply to: 263153288 [](ancestors = 263153288)
@@ -13,6 +13,7 @@ | |||
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.InferenceTesting" + PublicKey.TestValue)] | |||
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.StaticPipelineTesting" + PublicKey.TestValue)] | |||
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.OnnxTransformerTest" + PublicKey.TestValue)] | |||
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.Functional.Tests" + PublicKey.TestValue)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the express purpose of this "functional tests" is that it does not have access to internals. This to avoid the problem of us inadvertantly adding something we need to internals, so we have one project where we can be sure, "these tests are using the actual power of the public API." See ref #2470, and its motivations. Maybe @rogancarr and @eerhardt can express more about this, though I think the motivations are captured adequately by the PR and its linked issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES! This should be noted to ALL devs - we must NEVER have InternalsVisibleTo the Functional.Tests. That defeats the purpose.
Note the comment in the Functional.Tests.csproj:
machinelearning/test/Microsoft.ML.Functional.Tests/Microsoft.ML.Functional.Tests.csproj
Lines 3 to 7 in 40abffc
<PropertyGroup> | |
<!-- We are turning off strong naming to ensure we never add `InternalsVisibleTo` for these tests --> | |
<SignAssembly>false</SignAssembly> | |
<PublicSign>false</PublicSign> | |
</PropertyGroup> |
Maybe we need to make that comment stand out more. @yaeldekel - can you make it ALL CAPS or something that will catch people's eye, so they don't make the same change/mistake you just did? Maybe put it before and after the SignAssembly
property.
<PropertyGroup>
<!-- DON'T CHANGE THIS!!! We are turning off strong naming to ensure we never add `InternalsVisibleTo` for these tests -->
<SignAssembly>false</SignAssembly>
<!-- DON'T CHANGE THIS!!! We are turning off strong naming to ensure we never add `InternalsVisibleTo` for these tests -->
<PublicSign>false</PublicSign>
</PropertyGroup>
@@ -250,13 +250,15 @@ private static TransformerChain<ITransformer> Create(IHostEnvironment env, Model | |||
public static void SaveTo(this ITransformer transformer, IHostEnvironment env, Stream outputStream) | |||
=> new TransformerChain<ITransformer>(transformer).SaveTo(env, outputStream); | |||
|
|||
public static TransformerChain<ITransformer> LoadFrom(IHostEnvironment env, Stream stream) | |||
public static ITransformer LoadFrom(IHostEnvironment env, Stream stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ITransformer [](start = 22, length = 12)
Would it have been bad in the case where the source wasn't a chain (the @"Model"
branch below, to just put it in a chain? As far as I understand, this is for backcompat. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up doing the opposite of that - so the transform chain is never nested in the Model
directory.
In reply to: 264740060 [](ancestors = 264740060)
ModelLoadContext.LoadModel<TransformerChain<ITransformer>, SignatureLoadModel>(env, out var transformerChain, rep, LoaderSignature); | ||
ModelLoadContext.LoadModelOrNull<ITransformer, SignatureLoadModel>(env, out var transformerChain, rep, LoaderSignature); | ||
if (transformerChain == null) | ||
ModelLoadContext.LoadModel<ITransformer, SignatureLoadModel>(env, out transformerChain, rep, $@"Model\{LoaderSignature}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ [](start = 117, length = 1)
I think this formatting is somewhat locale sensitive. I might have preferred the simpler "Model\\" + LoaderSignature
. #Resolved
/// <returns>The loaded model.</returns> | ||
public ITransformer Load(Stream stream) => TransformerChain.LoadFrom(_env, stream); | ||
public ITransformer Load(Stream stream, out DataViewSchema inputSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataViewSchema inputSchema [](start = 52, length = 26)
Sometimes this will not be present. Should we clarify that this can be null in some cases? #Resolved
/// <returns>The loaded model.</returns> | ||
public ITransformer Load(Stream stream) => TransformerChain.LoadFrom(_env, stream); | ||
public ITransformer Load(Stream stream, out DataViewSchema inputSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Load(Stream stream, out DataViewSchema inputSchema [](start = 28, length = 50)
So I see above we have methods to save either a loader, or a schema, with a transform. I see this method to load the transform, with an input schema. What about to load the transformer, with the loader? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am adding a Load overload that produces a transformer and a loader. If the model was saved by saving the transformer+schema, do we prefer throwing an exception in this case (since there is no loader), or defining this method as TryLoad
?
In reply to: 264742808 [](ancestors = 264742808)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The usage of Try
suggests there's something wrong with the approach, and that the absence of a schema represents failure. I'm not sure I agree with that -- I think the situation where you have a transformer with no loader or input schema is a valid one we should allow, just not encourage (since it is a minority scenario). Now then, I think we should encourage people to think about saving it, and I think a good way to encourage people to make a deliberate decision is to force them to consciously feed us a null
, as suggested above. But that also means that we have to allow for the possibility upon loading that the schema will be null
, and I'd think we'd call this not an error condition. What you think?
In reply to: 265196459 [](ancestors = 265196459,264742808)
@@ -13,6 +17,11 @@ namespace Microsoft.ML | |||
/// </summary> | |||
public sealed class ModelOperationsCatalog : IInternalCatalog | |||
{ | |||
internal const string LoaderDirectory = "Loader"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoaderDirectory [](start = 30, length = 15)
So, as far as I can see, these are used entirely from CompositeDataLoader
, though maybe I missed something. So why are they here? #Resolved
/// <param name="stream">A readable, seekable stream to load from.</param> | ||
/// <param name="loader">The data loader from the model stream.</param> | ||
/// <returns>The transformer model from the model stream.</returns> | ||
public ITransformer Load(Stream stream, out IDataLoader<IMultiStreamSource> loader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Load [](start = 28, length = 4)
It would be nice if this was LoadWithDataLoader
, so that we could have mlContext.Model.Load(stream, out var thing)
without ambiguity. By having it be the same name, the var
won't work, which requires them to either declare ahead of time a fairly elaborate signature. #Resolved
@@ -46,6 +46,57 @@ internal static class DataViewConstructionUtils | |||
return new StreamingDataView<TRow>(env, data, internalSchemaDefn); | |||
} | |||
|
|||
public static StreamingDataView<TRow> CreateFromEnumerable<TRow>(IHostEnvironment env, IEnumerable<TRow> data, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateFromEnumerable [](start = 46, length = 20)
So, I don['t object exactly, it seems positive, but I'm not sure how this relates to the original issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added these so that I could add tests show what you can do with the schema once you load it with the new API.
In reply to: 266546399 [](ancestors = 266546399)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @yaeldekel !
…alibrated predictor
… of loading PredictionTransformer<object> from file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="stream">A readable, seekable stream to load from.</param> | ||
/// <returns>A model of type <see cref="CompositeDataLoader{IMultiStreamSource, ITransformer}"/> containing the loader | ||
/// and the transformer chain.</returns> | ||
public IDataLoader<IMultiStreamSource> Load(Stream stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDataLoader [](start = 15, length = 11)
it's kinda sad what now we have IDataLoader
along side with ITransformer
since it's increase complexity. Would be nice to have some documentation/samples which would explain in which case I need to save/load one thing, and in which another.
Also can I now have slot names in my model?
Fixes #2735.