-
Notifications
You must be signed in to change notification settings - Fork 1.9k
KMeans and Implicit weight cleanup #2158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/// <summary> | ||
/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>. | ||
/// </summary> | ||
/// <param name="input"></param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[](start = 12, length = 28)
with "Label" and "Features" columns.
From what I see, we actually don't use Score column at all, we just run predictor on top of feature column to produce "Score" column. So i'm not sure how necessary is to have that Score role in roleMappedData. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During transform stage. During Fit we train calibrator on combination of Predictor + FeatureColumn and LabelColumn.
But it's just an observation. You don't have to do anything about that.
In reply to: 248152774 [](ancestors = 248152774,248115187)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you can still put something into empty param, tho
In reply to: 248410187 [](ancestors = 248410187,248152774,248115187)
@@ -6,6 +6,7 @@ | |||
using System.Collections.Generic; | |||
using Microsoft.ML.Core.Data; | |||
using Microsoft.ML.Data; | |||
using Microsoft.ML.EntryPoints; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This namespace is in Microsoft.ML.Core. There is no circular dependancy
In reply to: 248115282 [](ancestors = 248115282)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just strange what we have EntryPoints as dependency. But I guess it's related to fact what Arguments class shared with entrypoints.
In reply to: 248152842 [](ancestors = 248152842,248115282)
/// <summary> | ||
/// Train a KMeans++ clustering algorithm. | ||
/// </summary> | ||
/// <param name="ctx">The regression context trainer object.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regression [](start = 34, length = 10)
clustering. #Resolved
settings.K = 4; | ||
settings.NumThreads = 1; | ||
settings.InitAlgorithm = Trainers.KMeans.KMeansPlusPlusTrainer.InitAlgorithm.Random; | ||
FeatureColumn = "Features", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Features" [](start = 37, length = 9)
DefaultColumnNames.Features #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should told me what I'm wrong and it should be nameof(BreastCancerFeatureVector.Features). or just "Features" since we use same string above. My bad.
In reply to: 248115971 [](ancestors = 248115971)
{ | ||
Contracts.CheckValue(ctx, nameof(ctx)); | ||
var env = CatalogUtils.GetEnvironment(ctx); | ||
return new KMeansPlusPlusTrainer(env, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options [](start = 50, length = 7)
Host.CheckValue(options, nameof(options));
You have this in internal KMeansPlusPlusTrainer(IHostEnvironment env, Options options)
if no one will specify options that check will throw.
Is it works? Do we somehow magically create Options somewhere? #Closed
/// </summary> | ||
/// <param name="ctx">The regression context trainer object.</param> | ||
/// <param name="options">Algorithm advanced settings.</param> | ||
public static KMeansPlusPlusTrainer KMeans(this ClusteringContext.ClusteringTrainers ctx, KMeansPlusPlusTrainer.Options options = null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= null [](start = 135, length = 7)
this should be a required parameter i believe #Closed
@@ -106,7 +106,7 @@ internal OlsLinearRegressionTrainer(IHostEnvironment env, Arguments args) | |||
advancedSettings?.Invoke(args); | |||
args.FeatureColumn = featureColumn; | |||
args.LabelColumn = labelColumn; | |||
args.WeightColumn = weightColumn; | |||
args.WeightColumn = weightColumn != null ? Optional<string>.Explicit(weightColumn) : Optional<string>.Implicit(DefaultColumnNames.Weight); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weightColumn != null [](start = 32, length = 21)
maybe we should have a separate PR just to fix the weights bug across the entire codebase ?
that way we isolate the API changes from the bug related to Weight. Also, less dependency between PRs related to learner API changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
than we'd need a separate PR to fix the ctor in KMeans .. and this PR does take care of both..
In reply to: 248117752 [](ancestors = 248117752)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my concern is by coupling the weights changes with API changes, we are making ourselves susceptible to inconsistency across PRs etc.
Would be much more convenient if we have consistent story for weights, get that reviewed + checked in and then move on with API fixing .
So I am not sure at this point if I need to make similar changes for the weights bug in my existing PRs or not ? I see we are adding overloads for MakeR4ScalarWeightColumn
.
In reply to: 248158333 [](ancestors = 248158333,248117752)
var rec = new TrainerEstimatorReconciler.Clustering( | ||
(env, featuresName, weightsName) => | ||
{ | ||
options.FeatureColumn = featuresName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
options [](start = 16, length = 7)
you need to check are they null, and create new one if they are. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's what happen when you are in a hurry to press publish..
In reply to: 248119409 [](ancestors = 248119409)
/// <example> | ||
/// <format type="text/markdown"> | ||
/// <] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SDCA [](start = 27, length = 4)
hmm #Closed
/// <example> | ||
/// <format type="text/markdown"> | ||
/// <] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KMeans_example [](start = 93, length = 14)
name of files is actually KMeans.cs #Closed
/// <returns>The predicted output.</returns> | ||
public static (Vector<float> score, Key<uint> predictedLabel) KMeans(this ClusteringContext.ClusteringTrainers ctx, | ||
Vector<float> features, Scalar<float> weights = null, | ||
KMeansPlusPlusTrainer.Options options = null, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the API pattern I am following is that both weights
and options
are required parameters for the Options
API
(based on Tom's example in issue description) #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different question, why do we want to the features, and the weight in this second extension; can we just have it be options, and provide everything in there?
If we provide the weights and the features in two places, than we have to deal with explaining who overrides what. Can we have this second extension have just the Options?
In reply to: 248121755 [](ancestors = 248121755)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, quoting from: #1798
"It is good to have the convenience for the simple arguments, however, if we have both simple and advanced settings, we should not mix them but have instead two distinct constructors/extension methods. (E.g., in the above, we would have two methods, one that took the advanced options.) To do otherwise is to invite confusion about which "wins" if we have the setting set in both.
Note that phase setting "set in both," which suggests that these settings object should retain the "simpler" settings in them. This reinforces feedback elsewhere as seen here."
Seems like the second extension should not have the label, features, weights in the signature; but just Options.
In reply to: 248157402 [](ancestors = 248157402,248121755)
@@ -151,7 +151,7 @@ public abstract class FastTreeTrainerBase<TArgs, TTransformer, TModel> : | |||
/// Constructor that is used when invoking the classes deriving from this, through maml. | |||
/// </summary> | |||
private protected FastTreeTrainerBase(IHostEnvironment env, TArgs args, SchemaShape.Column label) | |||
: base(Contracts.CheckRef(env, nameof(env)).Register(RegisterName), TrainerUtils.MakeR4VecFeature(args.FeatureColumn), label, TrainerUtils.MakeR4ScalarWeightColumn(args.WeightColumn, args.WeightColumn.IsExplicit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is/was args.WeightColumn.IsExplicit
used for? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is something that got introduced with EntryPoints. If the user doesn't specify a name for the optional columns, an Implict argument get created for them... not sure of the original thoughts around why not just leave it null. Maybe to avoid dealing with its serialization across languages.. idk.
In reply to: 248127604 [](ancestors = 248127604)
@@ -71,8 +71,11 @@ public void KMeansEstimator() | |||
|
|||
|
|||
// Pipeline. | |||
var pipeline = new KMeansPlusPlusTrainer(Env, featureColumn, weights: weights, | |||
advancedSettings: s => { s.InitAlgorithm = KMeansPlusPlusTrainer.InitAlgorithm.KMeansParallel; }); | |||
var pipeline = new KMeansPlusPlusTrainer(Env, new KMeansPlusPlusTrainer.Options { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KMeansPlusPlusTrainer [](start = 31, length = 21)
am surprised... is this invoking the constructor ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nothing wrong with it.. the extensions return the same object too, an IEstimator.
In reply to: 248377514 [](ancestors = 248377514)
/// <param name="ctx">The regression context trainer object.</param> | ||
/// <param name="features">The features, or independent variables.</param> | ||
/// <param name="ctx">The clustering context trainer object.</param> | ||
/// <param name="featureColumn">The features, or independent variables.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe featureColumnName
? In ML.NET, we mix the meanings of column name, column index, and column itself. I feel our naming can be more specific and therefore less ambiguous. #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave it to featureColumn, because we have had a loooonnnggg discussion about it, and a PR to standartize those.
In reply to: 248383325 [](ancestors = 248383325)
/// </summary> | ||
/// <param name="input"></param> | ||
/// <returns>A trained <see cref="CalibratorTransformer{TICalibrator}"/> that will transforms the data by adding the | ||
/// Probability column.</returns> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a constant string for the name of a probability column? If yes, it'd be nice to make a reference here (<see cref="">
). #Resolved
@@ -122,6 +122,12 @@ SchemaShape IEstimator<CalibratorTransformer<TICalibrator>>.GetOutputSchema(Sche | |||
return new SchemaShape(outColumns.Values); | |||
} | |||
|
|||
/// <summary> | |||
/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>. | |
/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/> which may transform the score column to a probability column. |
In addition, do we have specific names for score and probability columns? If yes, we can mention those constants here. #Resolved
[Argument(ArgumentType.AtMostOnce, HelpText = "Tolerance parameter for trainer convergence. Low = slower, more accurate", | ||
ShortName = "ot")] | ||
Name = "OptTol", ShortName = "ot")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ot" [](start = 45, length = 4)
you can enumerate ShortNames via comma "opttol,ot" #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently name is for maml, and i set it to the previous name for backwards compatibility.
In reply to: 248405346 [](ancestors = 248405346)
/// <summary> | ||
/// The number of clusters. | ||
/// </summary> | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "The number of clusters", SortOrder = 50, Name = "K")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name [](start = 100, length = 4)
ShortName maybe? it's a "K" is quite short, right? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name is for backwards compatibility. This is what maml uses.
In reply to: 248405583 [](ancestors = 248405583)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ShortNames for same thing, and as @najeeb-kazmi discover yesterday, Name is not always work, but ShortName do (or maybe combination of ShortName + Name).
In reply to: 248405874 [](ancestors = 248405874,248405583)
/// Memory budget (in MBs) to use for KMeans acceleration. | ||
/// </summary> | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "Memory budget (in MBs) to use for KMeans acceleration", | ||
Name = "AccelMemBudgetMb", ShortName = "accelMemBudgetMb")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name = "AccelMemBudgetMb" [](start = 16, length = 25)
I think we lowercase shortnames and name anyway, so I don't think you need to add this Name
parameter #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it to clearly documents the previous name? Najeeb has been doing the same on his PR of pluralizing some args names.
In reply to: 248405947 [](ancestors = 248405947)
|
||
internal KMeansPlusPlusTrainer(IHostEnvironment env, Arguments args) | ||
: this(env, args, null) | ||
/// <param name="options">The advanced arguments of the algorithm.</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// The advanced arguments of the algorithm. [](start = 7, length = 75)
from KMeansCatalog.cs
/// <param name="options">Algorithm advanced settings.</param>
I like mix of arguments, options and settings. Anyone who is favor of one of them get something :) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} | ||
|
||
public class Arguments : UnsupervisedLearnerInputBaseWithWeight | ||
public class Options : UnsupervisedLearnerInputBaseWithWeight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Options [](start = 21, length = 7)
Just an observation.
I found it's funny what we diverge during ITransformer/IEstimator conversion and for what used to be transforms we come up with ColumnInfo and we no longer rely on arguments. (We still have conversion from arguments to columnInfo for EntryPoints)
Eventually half of our entrypoints would become options, and other half remain arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point... and a bummer. Does it make sense to have a plan to reconcile?
In reply to: 248412000 [](ancestors = 248412000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
FeatureColumn = featuresName, | ||
ClustersCount = clustersCount, | ||
WeightColumn = weightsName != null ? Optional<string>.Explicit(weightsName) : Optional<string>.Implicit(DefaultColumnNames.Weight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weightsName != null ? Optional.Explicit(weightsName) : Optional.Implicit(DefaultColumnNames.Weight) [](start = 35, length = 115)
do we need to add this ?
Towards #1798
Renaming Args ->Options, internalizing ctors, and fixing the issue with the weigh column being initialized as Implicit or Explicit.