-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Added convenience constructor for set of transforms (#380). #405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Column = cols | ||
}; | ||
Host.CheckValue(args, nameof(args)); | ||
Host.CheckUserArg(Utils.Size(args.Column) > 0, nameof(args.Column)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why we do it other constructor, because we get them as parameters, and you need to validate them, but here, you create args and it content specifically.
I would rather have validation of outputColumn, inputColumns, rather than copypaste validation from other constructor. (input get checked in base constructor) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense! #Resolved
@@ -76,6 +76,12 @@ public BootstrapSampleTransform(IHostEnvironment env, Arguments args, IDataView | |||
_poolSize = args.PoolSize; | |||
} | |||
|
|||
public BootstrapSampleTransform(IHostEnvironment env, IDataView input, bool complement = false, uint? seed = null, bool shuffleInput = true, int poolSize = 1000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool complement = false, uint? seed = null, bool shuffleInput = true, int poolSize = 1000 [](start = 79, length = 89)
out of curiosity, why argument object is bad solution? why we want to flatten that object in set of parameters?
I'm also concern what we will have to support this defaults in two places, and it really error prone. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that transform can be neatly instantiated on one line. Otherwise for every transform a relevant "Argument" object is needed to be created first which is also another option.
In reply to: 197909492 [](ancestors = 197909492)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see why it can be helpful in case if you have only Column object inside argument class, you add one more constructor which is nice to use. But in this particular case, i don't see any advantages of using this constructor, and it makes code maintainability much harder.
Also compare this constructor with Cathash constructor.
In cathash you hide all other parameters from argument class other than columns, and here you expose them.
Can we be consistent?
I don't mind to have constructor which will take only Columns information and will use default parameters for internal argument class, but make sure to proper comment which indicates what if user needs more, it need to use Arguments class
In reply to: 197912065 [](ancestors = 197912065,197909492)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but again, feel free to bring heavy artillery (@TomFinley)
In reply to: 197921908 [](ancestors = 197921908,197912065,197909492)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this case, definitely constructor can be removed. But I have a concern regarding if user has to instantiate an Argument
object for a very simple case e.g. for all default values.
Another option could be
public BootstrapSampleTransform(IHostEnvironment env, IDataView input)
: this(env, new Arguments() { }, input)
{
}
Yes, for CatHash
more parameters can be added. Waiting for more comments on this. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I actually rather liked the constructor as it stands -- I do not consider the proposed replacement BootstrapSampleTransform(IHostEnvironment env, IDataView input)
acceptable because the bootstrap sampler I expect is one of those things that, when it is used, will be used in a pair: first to get the in-bag sample (for training), then to get the out of bag sample (for validation and testing). Whether we expose the other things as parameters or not, I am less certain about the utility of these.
@Ivanidzo4ka , is the concern with having a convenience constructor with default arguments, that the defaults could diverge? If that is the case, would perhaps prefer constants (either directly as private const
in the transform class itself, or perhaps more ideally as public const
in a nested private static
class named Defaults
or something).
If the point is rather that there should be exactly one way to do this, I'm not entirely certain I agree -- the arguments object may need to exist for the command line parser and entry-points which (being reflection driven) require total uniformity in how things are instantiated, but this bears very little resemblance to how someone, if trying to write a .NET API, would actually write code. If it is a small "shim" as it is now over the arguments, I'm not sure I see the harm, and I see a considerable benefit. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just don't see much reason to have constructor which duplicate other constructor with one difference: instead of one object we accept all the fields of this object.
I just don't understand why it's better and what convenience it brings.
I'm perfectly fine with something like
class(main_option)
class(main_option, secondary_parameter)
i just don't understand why you want to have
class(A,B,C,D,E,F)
and
class(argument_object) where argument_object has fields A,B,C,D,E,F
In reply to: 197983331 [](ancestors = 197983331)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Where they can be applied, the new idioms are clearly easier to use and understand.
In reply to: 198273557 [](ancestors = 198273557,197983331)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is based on the convenience you get in calling constructor. In my point of view, calling
new BootstrapSampleTransform(env, input, Complement, Seed, ShuffleInput, PoolSize);
is more convenient and elegant than calling
new BootstrapSampleTransform(env, input, new Arguments() { Complement = complement, Seed = seed, ShuffleInput = shuffleInput, PoolSize = poolSize });
In reply to: 198273557 [](ancestors = 198273557,197983331)
Column = cols | ||
}; | ||
|
||
_bindings = new Bindings(args.Column, null, Source.Schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
args.Column [](start = 37, length = 11)
what's the point to create args, if you only use it property? #Resolved
XML docs for constructor/create methods will be added in next commit. #Closed |
@@ -120,6 +120,41 @@ public sealed class Arguments : TransformInputBase | |||
|
|||
public const string UserName = "Categorical Hash Transform"; | |||
|
|||
public static IDataTransform Create(IHostEnvironment env, IDataView input, int hashBits = 16, uint seed = 314489979, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For managing these default values, const can added and used wherever the defaults needed. #Closed
{ | ||
inputOutputColumns[i].inputColumn = inputOutputColumns[i].outputColumn = inputColumns[i]; | ||
} | ||
return Create(env, input, inputOutputColumns); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only practical use of the copy transform is to, effectively, rename columns. Therefore, handling the case where name of the output is identical to that of the input seems to serve no purpose. #Closed
|
||
public static IDataTransform Create(IHostEnvironment env, IDataView input, int hashBits = 16, uint seed = 314489979, | ||
bool ordered = true, int invertHash = 0, CategoricalTransform.OutputKind outputKind = CategoricalTransform.OutputKind.Bag, | ||
params(string inputColumn, string outputColumn)[] inputOutputColumns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
params(string inputColumn, string outputColumn)[] inputOutputColumns [](start = 12, length = 68)
Let's try not to do this "multi-column" thing where reasonably we cannot. I'd like to have each such transform like this handle the single mapping case easily -- the multi-column thing, if necessary, I'm generally happy to tell people to revert to the Arguments
scheme.
By insisting on handling that, and doing this params
thing, you have introduced something incredibly unfortunate: by far the most important argument is the name of the output (and, if different, the source), but since it is a params
it must go last. That's really too bad!
I propose that all constructor signatures for all one-to-one transforms follow this scheme:
public FooTransform(IHostEnvironment env, IDataView input, string name, string source = null, ...)
Where of course, ...
refers to any additional parameters. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the point a bit more broad than this "names" thing, it is completely fine and expected that the Arguments
case will fulfill some cases that the convenience constructor cannot. The convenience constructor should handle the most common cases of the transform successfully, and with as much brevity as possible. (Indeed, in the case of the GUI, it cannot handle multiple columns in one go, yet most users seem perfectly happy with this. I expect the same will be true of this API.) The constructor certainly needn't cover the total, full breadth of everything in the Arguments
. For example, I can definitely see exposing hashBits
and invertHash
, but I'm not sure I've seen too many people set seed
or ordered
.
In reply to: 197980521 [](ancestors = 197980521)
public static IDataTransform CreateGlobalContrastNormalizer(IHostEnvironment env, IDataView input, GcnArguments args) | ||
{ | ||
return new LpNormNormalizerTransform(env, args, input); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not certain I see the utility in this specific constructor. #Closed
public static IDataTransform CreateLpNormNormalizer(IHostEnvironment env, IDataView input, Arguments args) | ||
{ | ||
return new LpNormNormalizerTransform(env, args, input); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems to be of limited utility. #Closed
public static DropColumnsTransform CreateColumnSelector(IHostEnvironment env, IDataView input, params string[] columnsToKeep) | ||
{ | ||
return new DropColumnsTransform(env, new KeepArguments() { Column = columnsToKeep }, input); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to keep columns, we do DropColumnsTransform.CreateColumnSelector
. :) The way that we resolve this on the command line, entry-points, the GUI, and whatnot, is we have this "KeepColumnsTransform" loadname declared, to make it look like there is another transform.
The fact that we do so in the underlying C# code is just because previously this little bit of confusion didn't matter, since we weren't advocating that people use this directly. However since we now are, we ought to adjust accordingly.
For that reason, I wonder if we can make the relationship explicit...
public static class KeepColumnsTransform
{
public IDataTransform Create(IHostEnvironment env, IDataView input, params string[] columnsToKeep)
=> new DropColumnsTransform(env, new KeepArguments() { Column = columnsToKeep }, input);
}
or something along these lines.
Once that ambiguity is resolved, we can change "drop" creator back to a plain old constructor. #Closed
@@ -527,6 +527,19 @@ private static VersionInfo GetVersionInfo() | |||
|
|||
public override ISchema Schema => _bindings; | |||
|
|||
public ConcatTransform(IHostEnvironment env, IDataView input, string outputColumn, params string[] inputColumns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string outputColumn, params string[] inputColumns [](start = 70, length = 49)
This usage is inconsistent with your usage elsewhere, where the source is first, and the names of the output is second.
I would recommend actually making name of output first, and name of inputs second (as you have done here)... since I think the situation where you have multiple params
to indicate input columns is more common. Indeed I cannot think of a single case where a params
describes output columns, but I can think of instances where it describes input columns. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed to "name" and "source" as well to be consistent with their naming in Argument object.
In reply to: 197985474 [](ancestors = 197985474)
Source = inputColumns | ||
}; | ||
|
||
_bindings = new Bindings(cols, null, Source.Schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_bindings = new Bindings(cols, null, Source.Schema); [](start = 12, length = 52)
I'd prefer that you construct an arguments object, then pass it via : this(env, args, input)
. The reason is, you have missed some validation that happens -- e.g., you have missed the checks that currently appear in the alternate constructor on lines 551-552... I do not suggest duplicating them, but calling them.
Since you cannot call a constructor beyond that call AFAIK, this implies that you will probably need some sort of private static Arguments
utility method to create that Arguments
object, so you can pass it in. Or, perhaps even better, add a convenience constructor on the Arguments
object itself, to hold this logic.
Generally I feel like if the constructors here do anything at all, the scheme is probably not well designed, since that means that there is probably some code and logic duplication going on, (e.g., what we see between lines 540 and 554), and which also invites inconsistency (e.g., what we missed on lines 551-552). #Closed
/// <param name="source">Name of the column to be transformed. If this is null '<paramref name="name"/>' will be used.</param> | ||
/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param> | ||
/// <param name="invertHash">Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit.</param> | ||
/// <param name="outputKind">Output kind: Bag (multi-set vector), Ind (indicator vector), or Key (index).</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bag (multi-set vector), Ind (indicator vector), or Key (index [](start = 50, length = 61)
This would be better handled by XML comments on the enum itself, I'd think. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments to enum directly and updated comments here.
In reply to: 198231959 [](ancestors = 198231959)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -282,6 +369,22 @@ public static NormalizeTransform Create(IHostEnvironment env, BinArguments args, | |||
return func; | |||
} | |||
|
|||
public static NormalizeTransform CreateBinningNormalizer(IHostEnvironment env, IDataView input, string labelColumn, string name, string source = null, int numBins = Defaults.NumBins, int minBinSize = Defaults.MinBinSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static NormalizeTransform CreateBinningNormalizer(IHostEnvironment env, IDataView input, string labelColumn, string name, string source = null, int numBins = Defaults.NumBins, int minBinSize = Defaults.MinBinSize) [](start = 8, length = 220)
so long #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="subMean">Subtract mean from each value before normalizing.</param> | ||
/// <param name="useStdDev">Normalize by standard deviation rather than L2 norm.</param> | ||
/// <param name="scale">Scale features by this value.</param> | ||
public static IDataTransform CreateGlobalContrastNormalizer(IHostEnvironment env, IDataView input, string name, string source = null, bool subMean = Defaults.GcnSubMean, bool useStdDev = Defaults.UseStdDev, Float scale = Defaults.Scale) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public static IDataTransform CreateGlobalContrastNormalizer(IHostEnvironment env, IDataView input, string name, string source = null, bool subMean = Defaults.GcnSubMean, bool useStdDev = Defaults.UseStdDev, Float scale = Defaults.Scale) [](start = 8, length = 236)
so so long #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="hashBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param> | ||
/// <param name="invertHash">Limit the number of keys used to generate the slot name to this many. 0 means no invert hashing, -1 means no limit.</param> | ||
/// <param name="outputKind">The type of output expected.</param> | ||
public static IDataTransform Create(IHostEnvironment env, IDataView input, string name, string source =null, int hashBits = Defaults.HashBits, int invertHash = Defaults.InvertHash, CategoricalTransform.OutputKind outputKind = Defaults.OutputKind) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so so long #Resolved
: this(env, new Arguments(){ Column = new[] { new Column() { Source = source, Name = name }}}, input) | ||
{ | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blank lines with trailing whitespace on them is evil. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks all for the valuable feedback! |
Please don't merge this yet. |
* Added convenience constructor for set of transforms (dotnet#371). * Removed useless validation from Concate transform. * Added more parameters to some transforms. * Addressed reviewers' comments. * XML Comments added to constructors/helper methods. * Created private static class for managing default values. * Addressed reviewers' comments. * Resolved some formatting issues.
This PR fixes #380.
Added convenience constructor (or create method) for following set of transforms