Skip to content

Static typed Estimator/Transformer/Data #778

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Sep 4, 2018

Conversation

TomFinley
Copy link
Contributor

Fixes #632. The first version of the statically typed parallels to IEstimator, ITransformer, IDataView. Stub for the library's analyzer.

@TomFinley TomFinley requested review from eerhardt and Zruty0 August 30, 2018 15:08
@Zruty0 Zruty0 added the API Issues pertaining the friendly API label Aug 30, 2018
+ "1 1 2 4 15";
var dataSource = new BytesStreamSource(data);

var text = TextLoader.CreateReader(env, ctx => (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text [](start = 16, length = 4)

Unless you want to retain the reference to the reader, I'd rather have 'text' be already data.

Copy link
Contributor Author

@TomFinley TomFinley Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. I think perhaps though this might be part of the consequence of separating model and data, unless you want to back off the ideas in and make data a possibly model bearing object again. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can think more about this. I feel like this is still in flux.


In reply to: 214165485 [](ancestors = 214165485)

var textData = text.Read(dataSource).Wrapped;

var schema = textData.Schema;
// First verify that the columns are there. There ought to be at least one column corresponding to the identifiers in the tuple.
Copy link
Contributor

@Zruty0 Zruty0 Aug 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// [](start = 12, length = 2)

maybe put below code into a separate test method, like 'CheckSmth'? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. And please tell me you're insisting on that name. :)


In reply to: 214165914 [](ancestors = 214165914)

// The next step where we shuffle the names around a little bit is one where we are
// testing out the implicit usage of copy columns.

var est = text.CreateEstimator(r => (text: r.label, label: r.numericFeatures));
Copy link
Contributor

@Zruty0 Zruty0 Aug 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create [](start = 27, length = 6)

We wanted these to be 'Make' #Resolved

Copy link
Contributor Author

@TomFinley TomFinley Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is true, but I wonder if we ought not to make this a fairly global change rather than doing it piecemeal and having a state riddled with inconsistencies. I do as you say prefer Make to Create, I like a world where two distinct words are used for the same purpose even worse. #Resolved

Copy link
Contributor

@Zruty0 Zruty0 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about 'all user-facing calls to be Make' ?


In reply to: 214353450 [](ancestors = 214353450)

// testing out the implicit usage of copy columns.

var est = text.CreateEstimator(r => (text: r.label, label: r.numericFeatures));
var newText = text.Append(est);
Copy link
Contributor

@Zruty0 Zruty0 Aug 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Append [](start = 31, length = 6)

Will there be an AppendEstimator (named 'Append) ? #Resolved

Copy link
Contributor Author

@TomFinley TomFinley Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can do that. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we decided on is that we could have an Append on Estimator only, that is like the current CreateEstimator. Then out of any of these "schema-bearing" objects we can somehow create a new pipeline, which will be done with a method looking like Pipline.MakeNew or something that creates a new empty estimator. Let me know if you do not like it.


In reply to: 214167545 [](ancestors = 214167545)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Estimator.MakeNew to avoid promulgation of concepts?


In reply to: 214485968 [](ancestors = 214485968,214167545)

@safern safern closed this Aug 30, 2018
@safern safern reopened this Aug 30, 2018
where TTransformer : class, ITransformer
{
private readonly IHostEnvironment _env;
public TTransformer Wrapped { get; }
Copy link
Contributor

@Zruty0 Zruty0 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapped [](start = 28, length = 7)

maybe rename these things to be Estimator, Transformer and Data ? #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.AsDynamic ?


In reply to: 214470782 [](ancestors = 214470782)

{
internal static class PipelineColumnAnalyzer
{
public static void Analyze<TIn, TOut>(Func<TIn, TOut> func)
Copy link
Contributor

@Zruty0 Zruty0 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Analyze [](start = 27, length = 7)

Rename to Initialize ;-) #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @yaeldekel


In reply to: 214473601 [](ancestors = 214473601)

/// reconciler will work with <see cref="BlockMaker{TTupleShape}.CreateEstimator{TTupleOutShape}(Func{TTupleShape, TTupleOutShape})"/>
/// or other functions that involve the creation of transforms.
/// </summary>
public abstract class DataInputReconciler : Reconciler
Copy link
Contributor

@Zruty0 Zruty0 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataInputReconciler [](start = 26, length = 19)

maybe EstimatorReconciler ? #Resolved

// This holds the mappings of columns to names and back. Note that while the same column could be used on
// the *output*, e.g., you could hypothetically have `(a: r.Foo, b: r.Foo)`, we treat that as the last thing
// that is done.
var nameMap = new InvDictionary<string, PipelineColumn>();
Copy link
Contributor

@Zruty0 Zruty0 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InvDictionary [](start = 30, length = 13)

BidirectionalDictionary? #Resolved

}
}

// REVIEW: This ought to be a assigned earlier in the case of a copy-columns being necessary.
Copy link
Contributor

@Zruty0 Zruty0 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// REVIEW: This ought to be a assigned earlier in the case of a copy-columns being necessary. [](start = 11, length = 94)

no longer needed #Resolved

{
internal const string Category = "Bring the funk";

internal static class Funky
Copy link
Contributor

@Zruty0 Zruty0 Aug 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funky [](start = 30, length = 5)

... #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the analyzer will be empty I guess.


In reply to: 214478390 [](ancestors = 214478390)

Copy link
Contributor

@Zruty0 Zruty0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@TomFinley TomFinley changed the title WIP Static typed Estimator/Transformer/Data Static typed Estimator/Transformer/Data Aug 31, 2018
@TomFinley
Copy link
Contributor Author

Hi @Zruty0 I have removed the WIP tag, in case you wanted to merge this. If it is merged in by the time I look at it later I will phrase a separate PR for the SDCA/image extension methods.

@TomFinley
Copy link
Contributor Author

@Zruty0 image is there, omitting VectorToImage since it is mostly for diagnostics and anyway has no estimators.


In reply to: 417806604 [](ancestors = 417806604)

@eerhardt
Copy link
Member

eerhardt commented Sep 1, 2018

Does this project have any code in it? Can we hold off on adding it until it does have code?

Also, if this is intended to ship to customers, it shouldn't be under tools-local. It should be under the src folder.


Refers to: tools-local/Microsoft.ML.Analyzer/Microsoft.ML.Analyzer.csproj:1 in 24b4ffb. [](commit_id = 24b4ffb, deletion_comment = False)

@@ -5,7 +5,7 @@
</PropertyGroup>

<ItemGroup>
<PackageReference Include="Microsoft.CodeAnalysis.CSharp.Workspaces" Version="2.8.2" />
<PackageReference Include="Microsoft.CodeAnalysis.CSharp.Workspaces" Version="2.9.0" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually put common version numbers in the build/Dependencies.props file. That way they stay in sync across multiple projects. (like here and in Microsoft.ML.Analyzer)

@@ -5,3 +5,4 @@
using System.Runtime.CompilerServices;

[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.TestFramework, PublicKey=002400000480000094000000060200000024000052534131000400000100010015c01ae1f50e8cc09ba9eac9147cf8fd9fce2cfe9f8dce4f7301c4132ca9fb50ce8cbf1df4dc18dd4d210e4345c744ecb3365ed327efdbc52603faa5e21daa11234c8c4a73e51f03bf192544581ebe107adee3a34928e39d04e524a9ce729d5090bfd7dad9d10c722c0def9ccc08ff0a03790e48bcd1f9b6c476063e1966a1c4")]
//[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.Tests, PublicKey=002400000480000094000000060200000024000052534131000400000100010015c01ae1f50e8cc09ba9eac9147cf8fd9fce2cfe9f8dce4f7301c4132ca9fb50ce8cbf1df4dc18dd4d210e4345c744ecb3365ed327efdbc52603faa5e21daa11234c8c4a73e51f03bf192544581ebe107adee3a34928e39d04e524a9ce729d5090bfd7dad9d10c722c0def9ccc08ff0a03790e48bcd1f9b6c476063e1966a1c4")]
Copy link
Member

@eerhardt eerhardt Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed? #Resolved

Copy link
Contributor Author

@TomFinley TomFinley Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. #Resolved

Contracts.CheckParam(min >= 0, nameof(min), "min must be non-negative.");
Contracts.CheckParam(max >= min, nameof(max), "max must be greater than or equal to min.");
Contracts.CheckParam(min >= 0, nameof(min), "Must be non-negative");
Contracts.CheckParam(!(max < min), nameof(max), "If specified, must be greater than or equal to " + nameof(min));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be easier to read/understand if it was written like CheckParam(max == null || max.Value >= min)

Copy link
Contributor Author

@TomFinley TomFinley Sep 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I'm less enthusiastic about this suggestion than some of your other suggestions. Whom exactly is the suggestion meant to serve? That set of people that don't realize that int? and int are different types?


In reply to: 214523116 [](ancestors = 214523116)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whom exactly is the suggestion meant to serve?

People who are reading the code. IMO this uses a "trick" that null is not less than a value, which isn't obvious.

It also aligns the code with the English wording: If specified, must be greater than or equal to directly maps to max == null || max.Value >= min


In reply to: 214570724 [](ancestors = 214570724,214523116)

{
public sealed partial class TextLoader
{
public static DataReader<IMultiStreamSource, TTupleShape> CreateReader<TTupleShape>(
Copy link
Member

@eerhardt eerhardt Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an API perspective, it seems surprising to me that TextLoader.CreateReader returns a DataReader. I guess I would have expected it to return a TextLoader. Is this a pattern to be used by all "loaders"/"readers".

Which kind of asks another question: What's the difference between a "loader" and a "reader"? #Resolved

Copy link
Contributor Author

@TomFinley TomFinley Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A loader is an old concept which will almost certainly be deprecated. A "reader" (name is not finalized) is a new concept. In the old world where this was a tool and not an API, the idea of the head of a pipeline being assumed to come from file(s) (as represented by IMultiStreamSource) was a priviledged and central concept -- and this was called a loader. Now @Zruty0 has generalized this notion and the head of a pipeline is something called a reader, with a generic input parameter which may or may not be an IMultiStreamSource, so he had to come up with a different name, which is called "reader."

The following will happen. We will replace IDataLoader altogether with some sort of concept that takes a stream source not as something assumed to always be there, but as some type parameter that might be there or not -- that successor concept is, for the moment, termed IDataReader<IMultiStreamSource>. (We may also choose to rename what we are temporarily calling a "reader," as well. Who knows. Maybe we'll name it Fred.) So when you see loader you are seeing something whose name will I think be changed. (At least, I would not welcome these two concepts being present in the code in the same time.)

At any rate I view this as merely a renaming question. What is perhaps a more urgent question is whether we got the shape of the API right, rather than its names. In particular, in this case, there were three choices we had when we created this object, that it would be a data reader estimator, or a data reader, or just the data. We chose data reader for various reasons. Is that itself confusing? #Resolved

Copy link
Contributor Author

@TomFinley TomFinley Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's sort of like this. Imagine hypothetically there was some early version of .NET that had this idea of an Iterable as a central concept that involved iteration over integers specifically. But then some bright developer had the idea, "heck, what if I want to iterate over something else, like floats, or even some sort of reference type." So then they replace this old concept of IIterable with IEnumerable<int>. But, realistically, the pace of development being what it is possibly that cannot reasonably happen all at once. That is sort of how I view IDataLoader and IDataReader<>.

We're moving towards the generalization, but it's not something that will happen all at once. (And certainly not in this PR -- I am introducing here PiGSTy, and PiGSTy is deliberately engineered in such a way as to not change the paradigms of the underlying dynamic system, it merely makes them less hazardous to work with while cleaving to their idioms as closely as possible -- I don't even want to diverge as far as using a different name for an analogous method if I can possibly help it.) #Resolved

{
public sealed partial class TextLoader
{
public static DataReader<IMultiStreamSource, TTupleShape> CreateReader<TTupleShape>(
Copy link
Member

@eerhardt eerhardt Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add XML doc here. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, and also the public members of the Context object.


In reply to: 214523219 [](ancestors = 214523219)

<Project Sdk="Microsoft.NET.Sdk" ToolsVersion="15.0">
<PropertyGroup>
<DefineConstants>CORECLR</DefineConstants>
<OldToolsVersion>2.0</OldToolsVersion>
Copy link
Member

@eerhardt eerhardt Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something added by VS I guess. I'll remove it and see if VS complains.


In reply to: 214523228 [](ancestors = 214523228)

@eerhardt
Copy link
Member

eerhardt commented Sep 1, 2018

The tests in this directory are for testing the internal code analyzer (that is, the implementation of ML.NET itself), as opposed to general analyzer that ships with the library.

I don't know if I have a great suggestion yet, but just differentiating the "internal code analyzer" and the "public code analyzer" by the word Code doesn't seem like it is enough.

What if we named the "internal code analyzer": Microsoft.ML.InternalCodeAnalyzer?


Refers to: test/Microsoft.ML.CodeAnalyzer.Tests/Code/README.md:1 in 24b4ffb. [](commit_id = 24b4ffb, deletion_comment = False)

The new classes DataView, Estimator, Transformer are introduced.
We also introduce new methdos among these and the existing classes
to produce the statically-typed variants of these, with Fit,
Transform, and Append methods defined.
* schema assertion context
* Some code reorganization and cleanup
* Beginnings of verification of types
* Removed direction instantiation of containing types in favor of `AssertStatic` methods so as to get a structure with tuple-names to enable checking
@@ -0,0 +1,13 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to get this analyzer in our Microsoft.ML NuGet package, we will have to do some build authoring. Today our infrastructure only supports putting things in the build, lib and runtimes folders. It shouldn't be much work, but someone will have to do it.

Let me know if you want me to pick the NuGet authoring up, or if you want to handle it yourself (I can give you some tips if necessary).

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:


using System;

namespace Microsoft.ML.Data.StaticPipe
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we gone through any alternatives to the namespace StaticPipe?

@Zruty0 Zruty0 merged commit 6e0d8d0 into dotnet:master Sep 4, 2018
public sealed class DataReaderEstimator<TIn, TTupleShape, TDataReader> : SchemaBearing<TTupleShape>
where TDataReader : class, IDataReader<TIn>
{
public IDataReaderEstimator<TIn, TDataReader> AsDynamic { get; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if AsDynamic should be a method. Typically we don't have "As" properties...

public IEstimator<TTransformer> AsDynamic { get; }
private readonly StaticSchemaShape _inShape;

internal Estimator(IHostEnvironment env, IEstimator<TTransformer> estimator, StaticSchemaShape inShape, StaticSchemaShape outShape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give a better name for the estimator parameter? It is confusing that this class is named Estimator, and it takes an estimator in its constructor. Is that the innerEstimator? baseEstimator? previousEstimator? etc.

using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Data;

namespace Microsoft.ML.Data.StaticPipe.Runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is the idea that the Runtime namespace is separate from the rest of the StaticPipe types because an end user isn't supposed to use types in this namespace? (Similar to how the Microsoft.ML.Runtime namespace isn't supposed to contain "user facing" types?)
  2. In the .NET BCL team, we physically lay out our .cs files to match the namespace they are in. It makes the code easier to understand and navigate. Can the classes that are in the Runtime namespace be moved to a sub folder named Runtime?

/// object.</typeparam>
public abstract class ReaderReconciler<TIn> : Reconciler
{
public ReaderReconciler() : base() { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit - both here and for EstimatorReconciler) - typically public constructors on abstract classes are frowned upon. These should be protected. See https://docs.microsoft.com/en-us/visualstudio/code-quality/ca1012-abstract-types-should-not-have-constructors

/// use the input <see cref="SchemaAssertionContext"/> to declare a <see cref="ValueTuple"/>
/// of the <see cref="PipelineColumn"/> indices, properly named</param>
/// <returns>A statically typed wrapping of the input view</returns>
public static DataView<T> AssertStatic<[IsShape] T>(this IDataView view, IHostEnvironment env,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we should have methods named/prefixed Assert but aren't in the same vein as Debug.Assert. I think it may be confusing. How about EnsureStatic?

/// type for communicating text.
/// </summary>
/// <returns>The basic type used to represent an item type in the static pipeline</returns>
private static Type StaticKind(DataKind kind)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we already have a method that does this?

@TomFinley TomFinley deleted the tfinley/StrongPipe branch September 4, 2018 23:33
@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants