Skip to content

KMeans and Implicit weight cleanup #2158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 16, 2019

Conversation

sfilipi
Copy link
Member

@sfilipi sfilipi commented Jan 16, 2019

Towards #1798

Renaming Args ->Options, internalizing ctors, and fixing the issue with the weigh column being initialized as Implicit or Explicit.

/// <summary>
/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>.
/// </summary>
/// <param name="input"></param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 12, length = 28)

with "Label" and "Features" columns.
From what I see, we actually don't use Score column at all, we just run predictor on top of feature column to produce "Score" column. So i'm not sure how necessary is to have that Score role in roleMappedData. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see line 262. We do use the score column.


In reply to: 248115187 [](ancestors = 248115187)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During transform stage. During Fit we train calibrator on combination of Predictor + FeatureColumn and LabelColumn.
But it's just an observation. You don't have to do anything about that.


In reply to: 248152774 [](ancestors = 248152774,248115187)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can still put something into empty param, tho


In reply to: 248410187 [](ancestors = 248410187,248152774,248115187)

@@ -6,6 +6,7 @@
using System.Collections.Generic;
using Microsoft.ML.Core.Data;
using Microsoft.ML.Data;
using Microsoft.ML.EntryPoints;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right. #Closed

Copy link
Member Author

@sfilipi sfilipi Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This namespace is in Microsoft.ML.Core. There is no circular dependancy


In reply to: 248115282 [](ancestors = 248115282)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just strange what we have EntryPoints as dependency. But I guess it's related to fact what Arguments class shared with entrypoints.


In reply to: 248152842 [](ancestors = 248152842,248115282)

/// <summary>
/// Train a KMeans++ clustering algorithm.
/// </summary>
/// <param name="ctx">The regression context trainer object.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regression [](start = 34, length = 10)

clustering. #Resolved

settings.K = 4;
settings.NumThreads = 1;
settings.InitAlgorithm = Trainers.KMeans.KMeansPlusPlusTrainer.InitAlgorithm.Random;
FeatureColumn = "Features",
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Features" [](start = 37, length = 9)

DefaultColumnNames.Features #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should told me what I'm wrong and it should be nameof(BreastCancerFeatureVector.Features). or just "Features" since we use same string above. My bad.


In reply to: 248115971 [](ancestors = 248115971)

{
Contracts.CheckValue(ctx, nameof(ctx));
var env = CatalogUtils.GetEnvironment(ctx);
return new KMeansPlusPlusTrainer(env, options);
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options [](start = 50, length = 7)

Host.CheckValue(options, nameof(options));
You have this in internal KMeansPlusPlusTrainer(IHostEnvironment env, Options options)
if no one will specify options that check will throw.
Is it works? Do we somehow magically create Options somewhere? #Closed

/// </summary>
/// <param name="ctx">The regression context trainer object.</param>
/// <param name="options">Algorithm advanced settings.</param>
public static KMeansPlusPlusTrainer KMeans(this ClusteringContext.ClusteringTrainers ctx, KMeansPlusPlusTrainer.Options options = null)
Copy link
Member

@abgoswam abgoswam Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= null [](start = 135, length = 7)

this should be a required parameter i believe #Closed

@@ -106,7 +106,7 @@ internal OlsLinearRegressionTrainer(IHostEnvironment env, Arguments args)
advancedSettings?.Invoke(args);
args.FeatureColumn = featureColumn;
args.LabelColumn = labelColumn;
args.WeightColumn = weightColumn;
args.WeightColumn = weightColumn != null ? Optional<string>.Explicit(weightColumn) : Optional<string>.Implicit(DefaultColumnNames.Weight);
Copy link
Member

@abgoswam abgoswam Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weightColumn != null [](start = 32, length = 21)

maybe we should have a separate PR just to fix the weights bug across the entire codebase ?

that way we isolate the API changes from the bug related to Weight. Also, less dependency between PRs related to learner API changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

than we'd need a separate PR to fix the ctor in KMeans .. and this PR does take care of both..


In reply to: 248117752 [](ancestors = 248117752)

Copy link
Member

@abgoswam abgoswam Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern is by coupling the weights changes with API changes, we are making ourselves susceptible to inconsistency across PRs etc.

Would be much more convenient if we have consistent story for weights, get that reviewed + checked in and then move on with API fixing .

So I am not sure at this point if I need to make similar changes for the weights bug in my existing PRs or not ? I see we are adding overloads for MakeR4ScalarWeightColumn.


In reply to: 248158333 [](ancestors = 248158333,248117752)

var rec = new TrainerEstimatorReconciler.Clustering(
(env, featuresName, weightsName) =>
{
options.FeatureColumn = featuresName;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

options [](start = 16, length = 7)

you need to check are they null, and create new one if they are. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what happen when you are in a hurry to press publish..


In reply to: 248119409 [](ancestors = 248119409)

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[SDCA](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/KMeans_example.cs)]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDCA [](start = 27, length = 4)

hmm #Closed

/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[SDCA](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/KMeans_example.cs)]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KMeans_example [](start = 93, length = 14)

name of files is actually KMeans.cs #Closed

/// <returns>The predicted output.</returns>
public static (Vector<float> score, Key<uint> predictedLabel) KMeans(this ClusteringContext.ClusteringTrainers ctx,
Vector<float> features, Scalar<float> weights = null,
KMeansPlusPlusTrainer.Options options = null,
Copy link
Member

@abgoswam abgoswam Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the API pattern I am following is that both weights and options are required parameters for the Options API

(based on Tom's example in issue description) #Pending

Copy link
Member Author

@sfilipi sfilipi Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different question, why do we want to the features, and the weight in this second extension; can we just have it be options, and provide everything in there?
If we provide the weights and the features in two places, than we have to deal with explaining who overrides what. Can we have this second extension have just the Options?

@TomFinley @glebuk


In reply to: 248121755 [](ancestors = 248121755)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, quoting from: #1798

"It is good to have the convenience for the simple arguments, however, if we have both simple and advanced settings, we should not mix them but have instead two distinct constructors/extension methods. (E.g., in the above, we would have two methods, one that took the advanced options.) To do otherwise is to invite confusion about which "wins" if we have the setting set in both.

Note that phase setting "set in both," which suggests that these settings object should retain the "simpler" settings in them. This reinforces feedback elsewhere as seen here."

Seems like the second extension should not have the label, features, weights in the signature; but just Options.


In reply to: 248157402 [](ancestors = 248157402,248121755)

@@ -151,7 +151,7 @@ public abstract class FastTreeTrainerBase<TArgs, TTransformer, TModel> :
/// Constructor that is used when invoking the classes deriving from this, through maml.
/// </summary>
private protected FastTreeTrainerBase(IHostEnvironment env, TArgs args, SchemaShape.Column label)
: base(Contracts.CheckRef(env, nameof(env)).Register(RegisterName), TrainerUtils.MakeR4VecFeature(args.FeatureColumn), label, TrainerUtils.MakeR4ScalarWeightColumn(args.WeightColumn, args.WeightColumn.IsExplicit))
Copy link
Contributor

@justinormont justinormont Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is/was args.WeightColumn.IsExplicit used for? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is something that got introduced with EntryPoints. If the user doesn't specify a name for the optional columns, an Implict argument get created for them... not sure of the original thoughts around why not just leave it null. Maybe to avoid dealing with its serialization across languages.. idk.


In reply to: 248127604 [](ancestors = 248127604)

@dotnet dotnet deleted a comment from abgoswam Jan 16, 2019
@dotnet dotnet deleted a comment from abgoswam Jan 16, 2019
@@ -71,8 +71,11 @@ public void KMeansEstimator()


// Pipeline.
var pipeline = new KMeansPlusPlusTrainer(Env, featureColumn, weights: weights,
advancedSettings: s => { s.InitAlgorithm = KMeansPlusPlusTrainer.InitAlgorithm.KMeansParallel; });
var pipeline = new KMeansPlusPlusTrainer(Env, new KMeansPlusPlusTrainer.Options {
Copy link
Member

@abgoswam abgoswam Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KMeansPlusPlusTrainer [](start = 31, length = 21)

am surprised... is this invoking the constructor ? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing wrong with it.. the extensions return the same object too, an IEstimator.


In reply to: 248377514 [](ancestors = 248377514)

/// <param name="ctx">The regression context trainer object.</param>
/// <param name="features">The features, or independent variables.</param>
/// <param name="ctx">The clustering context trainer object.</param>
/// <param name="featureColumn">The features, or independent variables.</param>
Copy link
Member

@wschin wschin Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe featureColumnName? In ML.NET, we mix the meanings of column name, column index, and column itself. I feel our naming can be more specific and therefore less ambiguous. #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave it to featureColumn, because we have had a loooonnnggg discussion about it, and a PR to standartize those.

#1524 (comment)


In reply to: 248383325 [](ancestors = 248383325)

/// </summary>
/// <param name="input"></param>
/// <returns>A trained <see cref="CalibratorTransformer{TICalibrator}"/> that will transforms the data by adding the
/// Probability column.</returns>
Copy link
Member

@wschin wschin Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a constant string for the name of a probability column? If yes, it'd be nice to make a reference here (<see cref="">). #Resolved

@@ -122,6 +122,12 @@ SchemaShape IEstimator<CalibratorTransformer<TICalibrator>>.GetOutputSchema(Sche
return new SchemaShape(outColumns.Values);
}

/// <summary>
/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>.
Copy link
Member

@wschin wschin Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>.
/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/> which may transform the score column to a probability column.

In addition, do we have specific names for score and probability columns? If yes, we can mention those constants here. #Resolved

[Argument(ArgumentType.AtMostOnce, HelpText = "Tolerance parameter for trainer convergence. Low = slower, more accurate",
ShortName = "ot")]
Name = "OptTol", ShortName = "ot")]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ot" [](start = 45, length = 4)

you can enumerate ShortNames via comma "opttol,ot" #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently name is for maml, and i set it to the previous name for backwards compatibility.


In reply to: 248405346 [](ancestors = 248405346)

/// <summary>
/// The number of clusters.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "The number of clusters", SortOrder = 50, Name = "K")]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name [](start = 100, length = 4)

ShortName maybe? it's a "K" is quite short, right? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name is for backwards compatibility. This is what maml uses.


In reply to: 248405583 [](ancestors = 248405583)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ShortNames for same thing, and as @najeeb-kazmi discover yesterday, Name is not always work, but ShortName do (or maybe combination of ShortName + Name).


In reply to: 248405874 [](ancestors = 248405874,248405583)

/// Memory budget (in MBs) to use for KMeans acceleration.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Memory budget (in MBs) to use for KMeans acceleration",
Name = "AccelMemBudgetMb", ShortName = "accelMemBudgetMb")]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name = "AccelMemBudgetMb" [](start = 16, length = 25)

I think we lowercase shortnames and name anyway, so I don't think you need to add this Name parameter #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it to clearly documents the previous name? Najeeb has been doing the same on his PR of pluralizing some args names.


In reply to: 248405947 [](ancestors = 248405947)


internal KMeansPlusPlusTrainer(IHostEnvironment env, Arguments args)
: this(env, args, null)
/// <param name="options">The advanced arguments of the algorithm.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Jan 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// The advanced arguments of the algorithm. [](start = 7, length = 75)

from KMeansCatalog.cs
/// <param name="options">Algorithm advanced settings.</param>
I like mix of arguments, options and settings. Anyone who is favor of one of them get something :) #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inclusive !


In reply to: 248406901 [](ancestors = 248406901)

}

public class Arguments : UnsupervisedLearnerInputBaseWithWeight
public class Options : UnsupervisedLearnerInputBaseWithWeight
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options [](start = 21, length = 7)

Just an observation.
I found it's funny what we diverge during ITransformer/IEstimator conversion and for what used to be transforms we come up with ColumnInfo and we no longer rely on arguments. (We still have conversion from arguments to columnInfo for EntryPoints)

Eventually half of our entrypoints would become options, and other half remain arguments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point... and a bummer. Does it make sense to have a plan to reconcile?


In reply to: 248412000 [](ancestors = 248412000)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sfilipi sfilipi merged commit b89ce70 into dotnet:master Jan 16, 2019
@sfilipi sfilipi deleted the 1798KMeansHalLearners branch January 16, 2019 20:33
{
FeatureColumn = featuresName,
ClustersCount = clustersCount,
WeightColumn = weightsName != null ? Optional<string>.Explicit(weightsName) : Optional<string>.Implicit(DefaultColumnNames.Weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weightsName != null ? Optional.Explicit(weightsName) : Optional.Implicit(DefaultColumnNames.Weight) [](start = 35, length = 115)

do we need to add this ?

@ghost ghost locked as resolved and limited conversation to collaborators Mar 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants