-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Refactoring of Options for ImagePixelExtractingEstimator #3033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
48c0f33
to
1d790bc
Compare
Codecov Report
@@ Coverage Diff @@
## master #3033 +/- ##
==========================================
+ Coverage 72.41% 72.42% +<.01%
==========================================
Files 803 803
Lines 143851 143863 +12
Branches 16173 16175 +2
==========================================
+ Hits 104171 104188 +17
+ Misses 35258 35256 -2
+ Partials 4422 4419 -3
|
if (!inputSchema.TryGetColumnIndex(item.Source, out colSrc)) | ||
throw host.ExceptUserArg(nameof(OneToOneColumn.Source), "Source column '{0}' not found", item.Source); | ||
if (!inputSchema.TryGetColumnIndex(item.InputColumnName, out colSrc)) | ||
throw host.ExceptUserArg(nameof(OneToOneColumn.InputColumnName), "Source column '{0}' not found", item.InputColumnName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Source [](start = 90, length = 6)
Input column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -290,8 +290,8 @@ internal static IDataTransform Create(IHostEnvironment env, MinMaxArguments args | |||
|
|||
var columns = args.Columns | |||
.Select(col => new NormalizingEstimator.MinMaxColumnOptions( | |||
col.Name, | |||
col.Source ?? col.Name, | |||
col.OutputColumnName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OutputColumnName [](start = 24, length = 16)
I don't think that this is necessarily important, but in prior discussions we've discussed that in the case of a column options specifically, given that it's clear this is talking about columns we have discussed that it may be wise to name it InputName
as opposed to InputColumnName
, since it's obvious we're talking about columns here.
/// <summary> | ||
/// Specifies input and output column names for a transformation | ||
/// </summary> | ||
public class OneToOneColumn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OneToOneColumn [](start = 17, length = 14)
Is OneToOneColumn a good name? Usually one-to-one
describes a mapping or a function, which doesn't sound a Column
. Maybe we could try OneToOneBind
?
/// <summary> | ||
/// Instantiates a <see cref="ColumnOptions"/> from a tuple of input and output column names. | ||
/// </summary> | ||
public static implicit operator OneToOneColumn((string outputColumnName, string inputColumnName) value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(string outputColumnName, string inputColumnName) value [](start = 55, length = 55)
I have an impression that tuples are not very welcomed.
public string OutputColumnName; | ||
|
||
/// <summary>Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref= "OutputColumnName"/> will be used as source.</summary> | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "Name of the source column", ShortName = "source, src")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShortName [](start = 83, length = 9)
Note that you could have used the Name
field here. That might have been preferable if we care about not changing the behavior of command line and entry-points, at least not yet. (We can always adjust later if we really want to.) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
/// <summary>Name of the column to transform. If set to <see langword="null"/>, the value of the <see cref= "OutputColumnName"/> will be used as source.</summary> | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "Name of the source column", ShortName = "source, src")] | ||
public string InputColumnName = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null [](start = 40, length = 4)
This null
declaration serves no purpose. The default value of the field is null if unassigned, as we see above with OutputColumnName
. #Resolved
@@ -14,16 +14,40 @@ | |||
|
|||
namespace Microsoft.ML.Data | |||
{ | |||
internal abstract class SourceNameColumnBase | |||
public class OneToOneColumn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OneToOneColumn [](start = 17, length = 14)
Do you mean this class to be used and directly instantiable? I would really not like that. I would prefer that our API be over sealed classes -- otherwise we'll have weird things like it being possible to assign OneToOneColumn
taking APIs to mismatching descending classes.
As a general policy, we prefer all classes to be either abstract or sealed. We have generally followed this advice. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am making it abstract, thanks for pointing it out.
In reply to: 267543430 [](ancestors = 267543430)
/// <summary> | ||
/// Instantiates a <see cref="ColumnOptions"/> from the column name. | ||
/// </summary> | ||
public static implicit operator OneToOneColumn(string outputColumnName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implicit [](start = 22, length = 8)
I get a little nervous when I see implicit operators from common types, especially since in the same APIs that we're going to be using these sorts of types we're going to have string
parameters being used. We could sort of get away with it with the value-tuples above -- we don't use value-tuples anywhere else -- but this makes me very nervous. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok will remove the implicit operators, and simply have a constructor in the options class that will take tuples of strings.
In reply to: 267545690 [](ancestors = 267545690)
return new OneToOneColumn(outputColumnName); | ||
} | ||
|
||
public OneToOneColumn(string outputColumnName, string inputColumnName = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OneToOneColumn [](start = 15, length = 14)
This constructor is public. Note that we have a non-sealed directly instantiable class... #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's start with the obvious, we need the class structure to change. But, this PR makes me very nervous. We're undertaking an expansion of our public API in a novel direction when we have, at best, four working days to detect any issues, and no clear compelling scenario to directly inform whether we're making the right choices or not. That just seems rather unwise to me, even if this PR had no problems.
The concern of not taking this PR is that users will have to specify a transform per column. This can result in N*M slowdown where N is number of columns and M is the number of trainable transforms per column. The most common example is Categorical, We would strongly prefer to fit all the categories for all columns in a single pass. If there is a simple way to expose just the multi-column support, but not the rest of the arg object, that would be a fine solution at this time. |
Closing in favor of a simpler approach prior to v1. |
Been thinking that the multicolumn case might be more common than we think, for a few transforms, after seeing this PR: https://github.com/dotnet/machinelearning-samples/pull/321/files#diff-5f3b1d72514e8cdce1dcac64aaf8a097 In reply to: 475059668 [](ancestors = 475059668) |
This PR is an example solution for #2884. Once I receive feedback on this, I will continue with the rest of the transforms.
The purpose of this PR is twofold:
Options
in other transforms (commit 1)OneToOneColumn
public, removed empty className
andSource
toOutputColumnName
andInputColumnName
OneToOneColumn
to simplify the multicolumn mapping scenarioOptions
class forImagePixelExtractingEstimator
(commit 2)ColumnOptions
to transformer and renamedColumnInfo
Options
andColumn
to estimatorThe third commit fixes tests and entrypoint catalog.
Note that with the combination of the implicit operators on
OneToOneColumn
and the constructor takingOneToOneColumn
in theOptions
class, it is easier to define the multicolumn mapping scenario where the columns don't require column specific settings. For an example of the behavior see the test TestImagePixelExtractOptions in ImagesTests.cs: