Skip to content

Refactoring of Options for ImagePixelExtractingEstimator #3033

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

artidoro
Copy link
Contributor

This PR is an example solution for #2884. Once I receive feedback on this, I will continue with the rest of the transforms.

The purpose of this PR is twofold:

  • does the ground work to more easily enable the refactoring of Options in other transforms (commit 1)
    • most of the work is in the file ColumnBindingsBase.cs.
    • made OneToOneColumn public, removed empty class
    • renamed Name and Source to OutputColumnName and InputColumnName
    • added implicit operators to OneToOneColumn to simplify the multicolumn mapping scenario
  • refactors the Options class for ImagePixelExtractingEstimator (commit 2)
    • moved immutable ColumnOptions to transformer and renamed ColumnInfo
    • moved Options and Column to estimator
    • refactored extension and constructors

The third commit fixes tests and entrypoint catalog.

Note that with the combination of the implicit operators on OneToOneColumn and the constructor taking OneToOneColumn in the Options class, it is easier to define the multicolumn mapping scenario where the columns don't require column specific settings. For an example of the behavior see the test TestImagePixelExtractOptions in ImagesTests.cs:

// options1 and 2 should be exactly the same.
var options1 = new ImagePixelExtractingEstimator.Options
{
    ColumnOptions = new[]
    {
        new ImagePixelExtractingEstimator.ColumnOptions { OutputColumnName = "outputColumn1", InputColumnName = "inputColumn1" },
        new ImagePixelExtractingEstimator.ColumnOptions { OutputColumnName = "outputColumn2", InputColumnName = "inputColumn2" },
        new ImagePixelExtractingEstimator.ColumnOptions { OutputColumnName = "outputColumn3", InputColumnName = "inputColumn3" }
    }
};
var options2 = new ImagePixelExtractingEstimator.Options(("outputColumn1", "inputColumn1"), ("outputColumn2", "inputColumn2"), ("outputColumn3", "inputColumn3"));
// options3 has the same OutputColumnName as the previous two.
var options3 = new ImagePixelExtractingEstimator.Options("outputColumn1", "outputColumn2", "outputColumn3");

@artidoro artidoro added the API Issues pertaining the friendly API label Mar 20, 2019
@artidoro artidoro added this to the 0319 milestone Mar 20, 2019
@artidoro artidoro self-assigned this Mar 20, 2019
@artidoro artidoro force-pushed the pixelextractoptions branch from 48c0f33 to 1d790bc Compare March 20, 2019 02:35
@codecov
Copy link

codecov bot commented Mar 20, 2019

Codecov Report

Merging #3033 into master will increase coverage by <.01%.
The diff coverage is 75.55%.

@@            Coverage Diff             @@
##           master    #3033      +/-   ##
==========================================
+ Coverage   72.41%   72.42%   +<.01%     
==========================================
  Files         803      803              
  Lines      143851   143863      +12     
  Branches    16173    16175       +2     
==========================================
+ Hits       104171   104188      +17     
+ Misses      35258    35256       -2     
+ Partials     4422     4419       -3
Flag Coverage Δ
#Debug 72.42% <75.55%> (ø) ⬆️
#production 68.09% <62.58%> (ø) ⬆️
#test 88.62% <100%> (ø) ⬆️
Impacted Files Coverage Δ
...ft.ML.ImageAnalytics/EntryPoints/ImageAnalytics.cs 0% <ø> (ø) ⬆️
...osoft.ML.Transforms/Text/TokenizingByCharacters.cs 95.32% <0%> (ø) ⬆️
...rc/Microsoft.ML.Transforms/HashJoiningTransform.cs 85.75% <0%> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/WordTokenizing.cs 79.15% <0%> (ø) ⬆️
src/Microsoft.ML.ImageAnalytics/ImageLoader.cs 84.55% <0%> (ø) ⬆️
src/Microsoft.ML.Transforms/KeyToVectorMapping.cs 91.24% <0%> (ø) ⬆️
...rosoft.ML.Data/Transforms/LabelConvertTransform.cs 15.51% <0%> (ø) ⬆️
src/Microsoft.ML.ImageAnalytics/ImageResizer.cs 84.64% <0%> (ø) ⬆️
src/Microsoft.ML.Data/Transforms/NormalizeUtils.cs 43.63% <0%> (ø) ⬆️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 83.2% <0%> (ø) ⬆️
... and 48 more

if (!inputSchema.TryGetColumnIndex(item.Source, out colSrc))
throw host.ExceptUserArg(nameof(OneToOneColumn.Source), "Source column '{0}' not found", item.Source);
if (!inputSchema.TryGetColumnIndex(item.InputColumnName, out colSrc))
throw host.ExceptUserArg(nameof(OneToOneColumn.InputColumnName), "Source column '{0}' not found", item.InputColumnName);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Source [](start = 90, length = 6)

Input column

Copy link
Member

@sfilipi sfilipi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -290,8 +290,8 @@ internal static IDataTransform Create(IHostEnvironment env, MinMaxArguments args

var columns = args.Columns
.Select(col => new NormalizingEstimator.MinMaxColumnOptions(
col.Name,
col.Source ?? col.Name,
col.OutputColumnName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputColumnName [](start = 24, length = 16)

I don't think that this is necessarily important, but in prior discussions we've discussed that in the case of a column options specifically, given that it's clear this is talking about columns we have discussed that it may be wise to name it InputName as opposed to InputColumnName, since it's obvious we're talking about columns here.

/// <summary>
/// Specifies input and output column names for a transformation
/// </summary>
public class OneToOneColumn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneToOneColumn [](start = 17, length = 14)

Is OneToOneColumn a good name? Usually one-to-one describes a mapping or a function, which doesn't sound a Column. Maybe we could try OneToOneBind?

/// <summary>
/// Instantiates a <see cref="ColumnOptions"/> from a tuple of input and output column names.
/// </summary>
public static implicit operator OneToOneColumn((string outputColumnName, string inputColumnName) value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(string outputColumnName, string inputColumnName) value [](start = 55, length = 55)

I have an impression that tuples are not very welcomed.

public string OutputColumnName;

/// <summary>Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref= "OutputColumnName"/> will be used as source.</summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Name of the source column", ShortName = "source, src")]
Copy link
Contributor

@TomFinley TomFinley Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShortName [](start = 83, length = 9)

Note that you could have used the Name field here. That might have been preferable if we care about not changing the behavior of command line and entry-points, at least not yet. (We can always adjust later if we really want to.) #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will use Name = .... to prevent further changes


In reply to: 267542069 [](ancestors = 267542069)


/// <summary>Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref= "OutputColumnName"/> will be used as source.</summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Name of the source column", ShortName = "source, src")]
public string InputColumnName = null;
Copy link
Contributor

@TomFinley TomFinley Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null [](start = 40, length = 4)

This null declaration serves no purpose. The default value of the field is null if unassigned, as we see above with OutputColumnName. #Resolved

@@ -14,16 +14,40 @@

namespace Microsoft.ML.Data
{
internal abstract class SourceNameColumnBase
public class OneToOneColumn
Copy link
Contributor

@TomFinley TomFinley Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneToOneColumn [](start = 17, length = 14)

Do you mean this class to be used and directly instantiable? I would really not like that. I would prefer that our API be over sealed classes -- otherwise we'll have weird things like it being possible to assign OneToOneColumn taking APIs to mismatching descending classes.

As a general policy, we prefer all classes to be either abstract or sealed. We have generally followed this advice. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am making it abstract, thanks for pointing it out.


In reply to: 267543430 [](ancestors = 267543430)

/// <summary>
/// Instantiates a <see cref="ColumnOptions"/> from the column name.
/// </summary>
public static implicit operator OneToOneColumn(string outputColumnName)
Copy link
Contributor

@TomFinley TomFinley Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implicit [](start = 22, length = 8)

I get a little nervous when I see implicit operators from common types, especially since in the same APIs that we're going to be using these sorts of types we're going to have string parameters being used. We could sort of get away with it with the value-tuples above -- we don't use value-tuples anywhere else -- but this makes me very nervous. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will remove the implicit operators, and simply have a constructor in the options class that will take tuples of strings.


In reply to: 267545690 [](ancestors = 267545690)

return new OneToOneColumn(outputColumnName);
}

public OneToOneColumn(string outputColumnName, string inputColumnName = null)
Copy link
Contributor

@TomFinley TomFinley Mar 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OneToOneColumn [](start = 15, length = 14)

This constructor is public. Note that we have a non-sealed directly instantiable class... #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will remove the constructor


In reply to: 267546107 [](ancestors = 267546107)

Copy link
Contributor

@TomFinley TomFinley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's start with the obvious, we need the class structure to change. But, this PR makes me very nervous. We're undertaking an expansion of our public API in a novel direction when we have, at best, four working days to detect any issues, and no clear compelling scenario to directly inform whether we're making the right choices or not. That just seems rather unwise to me, even if this PR had no problems.

@glebuk
Copy link
Contributor

glebuk commented Mar 20, 2019

The concern of not taking this PR is that users will have to specify a transform per column. This can result in N*M slowdown where N is number of columns and M is the number of trainable transforms per column. The most common example is Categorical, We would strongly prefer to fit all the categories for all columns in a single pass.
If we ship v1.0 with no such mechanism to specify multiple columns per transform then we would leave >10x perf on the table and encourage a large set of users to use a sub-optimal patterns.
A secondary benefit would be to expose the common surface for the C# API, Nimbus and other entrypoint-based surface areas of ML.NET

If there is a simple way to expose just the multi-column support, but not the rest of the arg object, that would be a fine solution at this time.

@artidoro
Copy link
Contributor Author

Closing in favor of a simpler approach prior to v1.

@artidoro artidoro closed this Mar 20, 2019
@sfilipi
Copy link
Member

sfilipi commented Mar 21, 2019

Been thinking that the multicolumn case might be more common than we think, for a few transforms, after seeing this PR: https://github.com/dotnet/machinelearning-samples/pull/321/files#diff-5f3b1d72514e8cdce1dcac64aaf8a097


In reply to: 475059668 [](ancestors = 475059668)

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants