-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Static typed Estimator/Transformer/Data #778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
feed72b
to
55716ad
Compare
+ "1 1 2 4 15"; | ||
var dataSource = new BytesStreamSource(data); | ||
|
||
var text = TextLoader.CreateReader(env, ctx => ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text [](start = 16, length = 4)
Unless you want to retain the reference to the reader, I'd rather have 'text' be already data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. I think perhaps though this might be part of the consequence of separating model and data, unless you want to back off the ideas in and make data a possibly model bearing object again. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can think more about this. I feel like this is still in flux.
In reply to: 214165485 [](ancestors = 214165485)
var textData = text.Read(dataSource).Wrapped; | ||
|
||
var schema = textData.Schema; | ||
// First verify that the columns are there. There ought to be at least one column corresponding to the identifiers in the tuple. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// [](start = 12, length = 2)
maybe put below code into a separate test method, like 'CheckSmth'? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. And please tell me you're insisting on that name. :)
In reply to: 214165914 [](ancestors = 214165914)
// The next step where we shuffle the names around a little bit is one where we are | ||
// testing out the implicit usage of copy columns. | ||
|
||
var est = text.CreateEstimator(r => (text: r.label, label: r.numericFeatures)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Create [](start = 27, length = 6)
We wanted these to be 'Make' #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is true, but I wonder if we ought not to make this a fairly global change rather than doing it piecemeal and having a state riddled with inconsistencies. I do as you say prefer Make
to Create
, I like a world where two distinct words are used for the same purpose even worse. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// testing out the implicit usage of copy columns. | ||
|
||
var est = text.CreateEstimator(r => (text: r.label, label: r.numericFeatures)); | ||
var newText = text.Append(est); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Append [](start = 31, length = 6)
Will there be an AppendEstimator
(named 'Append) ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we can do that. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we decided on is that we could have an Append
on Estimator
only, that is like the current CreateEstimator
. Then out of any of these "schema-bearing" objects we can somehow create a new pipeline, which will be done with a method looking like Pipline.MakeNew
or something that creates a new empty estimator. Let me know if you do not like it.
In reply to: 214167545 [](ancestors = 214167545)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Estimator.MakeNew
to avoid promulgation of concepts?
In reply to: 214485968 [](ancestors = 214485968,214167545)
where TTransformer : class, ITransformer | ||
{ | ||
private readonly IHostEnvironment _env; | ||
public TTransformer Wrapped { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapped [](start = 28, length = 7)
maybe rename these things to be Estimator
, Transformer
and Data
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
internal static class PipelineColumnAnalyzer | ||
{ | ||
public static void Analyze<TIn, TOut>(Func<TIn, TOut> func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Analyze [](start = 27, length = 7)
Rename to Initialize ;-) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// reconciler will work with <see cref="BlockMaker{TTupleShape}.CreateEstimator{TTupleOutShape}(Func{TTupleShape, TTupleOutShape})"/> | ||
/// or other functions that involve the creation of transforms. | ||
/// </summary> | ||
public abstract class DataInputReconciler : Reconciler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataInputReconciler [](start = 26, length = 19)
maybe EstimatorReconciler
? #Resolved
// This holds the mappings of columns to names and back. Note that while the same column could be used on | ||
// the *output*, e.g., you could hypothetically have `(a: r.Foo, b: r.Foo)`, we treat that as the last thing | ||
// that is done. | ||
var nameMap = new InvDictionary<string, PipelineColumn>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InvDictionary [](start = 30, length = 13)
BidirectionalDictionary? #Resolved
} | ||
} | ||
|
||
// REVIEW: This ought to be a assigned earlier in the case of a copy-columns being necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// REVIEW: This ought to be a assigned earlier in the case of a copy-columns being necessary. [](start = 11, length = 94)
no longer needed #Resolved
{ | ||
internal const string Category = "Bring the funk"; | ||
|
||
internal static class Funky |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funky [](start = 30, length = 5)
... #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
55716ad
to
aaa7820
Compare
Hi @Zruty0 I have removed the WIP tag, in case you wanted to merge this. If it is merged in by the time I look at it later I will phrase a separate PR for the SDCA/image extension methods. |
Does this project have any code in it? Can we hold off on adding it until it does have code? Also, if this is intended to ship to customers, it shouldn't be under Refers to: tools-local/Microsoft.ML.Analyzer/Microsoft.ML.Analyzer.csproj:1 in 24b4ffb. [](commit_id = 24b4ffb, deletion_comment = False) |
@@ -5,7 +5,7 @@ | |||
</PropertyGroup> | |||
|
|||
<ItemGroup> | |||
<PackageReference Include="Microsoft.CodeAnalysis.CSharp.Workspaces" Version="2.8.2" /> | |||
<PackageReference Include="Microsoft.CodeAnalysis.CSharp.Workspaces" Version="2.9.0" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually put common version numbers in the build/Dependencies.props
file. That way they stay in sync across multiple projects. (like here and in Microsoft.ML.Analyzer
)
@@ -5,3 +5,4 @@ | |||
using System.Runtime.CompilerServices; | |||
|
|||
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.TestFramework, PublicKey=002400000480000094000000060200000024000052534131000400000100010015c01ae1f50e8cc09ba9eac9147cf8fd9fce2cfe9f8dce4f7301c4132ca9fb50ce8cbf1df4dc18dd4d210e4345c744ecb3365ed327efdbc52603faa5e21daa11234c8c4a73e51f03bf192544581ebe107adee3a34928e39d04e524a9ce729d5090bfd7dad9d10c722c0def9ccc08ff0a03790e48bcd1f9b6c476063e1966a1c4")] | |||
//[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.Tests, PublicKey=002400000480000094000000060200000024000052534131000400000100010015c01ae1f50e8cc09ba9eac9147cf8fd9fce2cfe9f8dce4f7301c4132ca9fb50ce8cbf1df4dc18dd4d210e4345c744ecb3365ed327efdbc52603faa5e21daa11234c8c4a73e51f03bf192544581ebe107adee3a34928e39d04e524a9ce729d5090bfd7dad9d10c722c0def9ccc08ff0a03790e48bcd1f9b6c476063e1966a1c4")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be removed? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. #Resolved
Contracts.CheckParam(min >= 0, nameof(min), "min must be non-negative."); | ||
Contracts.CheckParam(max >= min, nameof(max), "max must be greater than or equal to min."); | ||
Contracts.CheckParam(min >= 0, nameof(min), "Must be non-negative"); | ||
Contracts.CheckParam(!(max < min), nameof(max), "If specified, must be greater than or equal to " + nameof(min)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be easier to read/understand if it was written like CheckParam(max == null || max.Value >= min)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly I'm less enthusiastic about this suggestion than some of your other suggestions. Whom exactly is the suggestion meant to serve? That set of people that don't realize that int?
and int
are different types?
In reply to: 214523116 [](ancestors = 214523116)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whom exactly is the suggestion meant to serve?
People who are reading the code. IMO this uses a "trick" that null
is not less than a value, which isn't obvious.
It also aligns the code with the English wording: If specified, must be greater than or equal to
directly maps to max == null || max.Value >= min
In reply to: 214570724 [](ancestors = 214570724,214523116)
{ | ||
public sealed partial class TextLoader | ||
{ | ||
public static DataReader<IMultiStreamSource, TTupleShape> CreateReader<TTupleShape>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From an API perspective, it seems surprising to me that TextLoader.CreateReader
returns a DataReader
. I guess I would have expected it to return a TextLoader
. Is this a pattern to be used by all "loaders"/"readers".
Which kind of asks another question: What's the difference between a "loader" and a "reader"? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A loader is an old concept which will almost certainly be deprecated. A "reader" (name is not finalized) is a new concept. In the old world where this was a tool and not an API, the idea of the head of a pipeline being assumed to come from file(s) (as represented by IMultiStreamSource
) was a priviledged and central concept -- and this was called a loader. Now @Zruty0 has generalized this notion and the head of a pipeline is something called a reader, with a generic input parameter which may or may not be an IMultiStreamSource
, so he had to come up with a different name, which is called "reader."
The following will happen. We will replace IDataLoader
altogether with some sort of concept that takes a stream source not as something assumed to always be there, but as some type parameter that might be there or not -- that successor concept is, for the moment, termed IDataReader<IMultiStreamSource>
. (We may also choose to rename what we are temporarily calling a "reader," as well. Who knows. Maybe we'll name it Fred.) So when you see loader you are seeing something whose name will I think be changed. (At least, I would not welcome these two concepts being present in the code in the same time.)
At any rate I view this as merely a renaming question. What is perhaps a more urgent question is whether we got the shape of the API right, rather than its names. In particular, in this case, there were three choices we had when we created this object, that it would be a data reader estimator, or a data reader, or just the data. We chose data reader for various reasons. Is that itself confusing? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's sort of like this. Imagine hypothetically there was some early version of .NET that had this idea of an Iterable
as a central concept that involved iteration over integers specifically. But then some bright developer had the idea, "heck, what if I want to iterate over something else, like floats, or even some sort of reference type." So then they replace this old concept of IIterable
with IEnumerable<int>
. But, realistically, the pace of development being what it is possibly that cannot reasonably happen all at once. That is sort of how I view IDataLoader
and IDataReader<>
.
We're moving towards the generalization, but it's not something that will happen all at once. (And certainly not in this PR -- I am introducing here PiGSTy, and PiGSTy is deliberately engineered in such a way as to not change the paradigms of the underlying dynamic system, it merely makes them less hazardous to work with while cleaving to their idioms as closely as possible -- I don't even want to diverge as far as using a different name for an analogous method if I can possibly help it.) #Resolved
{ | ||
public sealed partial class TextLoader | ||
{ | ||
public static DataReader<IMultiStreamSource, TTupleShape> CreateReader<TTupleShape>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add XML doc here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, and also the public members of the Context
object.
In reply to: 214523219 [](ancestors = 214523219)
<Project Sdk="Microsoft.NET.Sdk" ToolsVersion="15.0"> | ||
<PropertyGroup> | ||
<DefineConstants>CORECLR</DefineConstants> | ||
<OldToolsVersion>2.0</OldToolsVersion> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something added by VS I guess. I'll remove it and see if VS complains.
In reply to: 214523228 [](ancestors = 214523228)
I don't know if I have a great suggestion yet, but just differentiating the "internal code analyzer" and the "public code analyzer" by the word What if we named the "internal code analyzer": Refers to: test/Microsoft.ML.CodeAnalyzer.Tests/Code/README.md:1 in 24b4ffb. [](commit_id = 24b4ffb, deletion_comment = False) |
The new classes DataView, Estimator, Transformer are introduced. We also introduce new methdos among these and the existing classes to produce the statically-typed variants of these, with Fit, Transform, and Append methods defined.
* schema assertion context * Some code reorganization and cleanup * Beginnings of verification of types * Removed direction instantiation of containing types in favor of `AssertStatic` methods so as to get a structure with tuple-names to enable checking
…put column names not configurable.
4896024
to
170c647
Compare
@@ -0,0 +1,13 @@ | |||
<Project Sdk="Microsoft.NET.Sdk"> | |||
|
|||
<PropertyGroup> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to get this analyzer in our Microsoft.ML
NuGet package, we will have to do some build authoring. Today our infrastructure only supports putting things in the build
, lib
and runtimes
folders. It shouldn't be much work, but someone will have to do it.
Let me know if you want me to pick the NuGet authoring up, or if you want to handle it yourself (I can give you some tips if necessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
using System; | ||
|
||
namespace Microsoft.ML.Data.StaticPipe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we gone through any alternatives to the namespace StaticPipe
?
public sealed class DataReaderEstimator<TIn, TTupleShape, TDataReader> : SchemaBearing<TTupleShape> | ||
where TDataReader : class, IDataReader<TIn> | ||
{ | ||
public IDataReaderEstimator<TIn, TDataReader> AsDynamic { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if AsDynamic
should be a method. Typically we don't have "As" properties...
public IEstimator<TTransformer> AsDynamic { get; } | ||
private readonly StaticSchemaShape _inShape; | ||
|
||
internal Estimator(IHostEnvironment env, IEstimator<TTransformer> estimator, StaticSchemaShape inShape, StaticSchemaShape outShape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we give a better name for the estimator
parameter? It is confusing that this class is named Estimator
, and it takes an estimator
in its constructor. Is that the innerEstimator
? baseEstimator
? previousEstimator
? etc.
using Microsoft.ML.Runtime; | ||
using Microsoft.ML.Runtime.Data; | ||
|
||
namespace Microsoft.ML.Data.StaticPipe.Runtime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is the idea that the
Runtime
namespace is separate from the rest of theStaticPipe
types because an end user isn't supposed to use types in this namespace? (Similar to how theMicrosoft.ML.Runtime
namespace isn't supposed to contain "user facing" types?) - In the .NET BCL team, we physically lay out our .cs files to match the namespace they are in. It makes the code easier to understand and navigate. Can the classes that are in the
Runtime
namespace be moved to a sub folder namedRuntime
?
/// object.</typeparam> | ||
public abstract class ReaderReconciler<TIn> : Reconciler | ||
{ | ||
public ReaderReconciler() : base() { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit - both here and for EstimatorReconciler) - typically public
constructors on abstract
classes are frowned upon. These should be protected
. See https://docs.microsoft.com/en-us/visualstudio/code-quality/ca1012-abstract-types-should-not-have-constructors
/// use the input <see cref="SchemaAssertionContext"/> to declare a <see cref="ValueTuple"/> | ||
/// of the <see cref="PipelineColumn"/> indices, properly named</param> | ||
/// <returns>A statically typed wrapping of the input view</returns> | ||
public static DataView<T> AssertStatic<[IsShape] T>(this IDataView view, IHostEnvironment env, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we should have methods named/prefixed Assert
but aren't in the same vein as Debug.Assert
. I think it may be confusing. How about EnsureStatic
?
/// type for communicating text. | ||
/// </summary> | ||
/// <returns>The basic type used to represent an item type in the static pipeline</returns> | ||
private static Type StaticKind(DataKind kind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we already have a method that does this?
Fixes #632. The first version of the statically typed parallels to
IEstimator
,ITransformer
,IDataView
. Stub for the library's analyzer.