KMeans and Implicit weight cleanup #2158

sfilipi · 2019-01-16T00:26:43Z

Towards #1798

Renaming Args ->Options, internalizing ctors, and fixing the issue with the weigh column being initialized as Implicit or Explicit.

Ivanidzo4ka · 2019-01-16T00:34:51Z

src/Microsoft.ML.Data/Prediction/CalibratorCatalog.cs

+        /// <summary>
+        /// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>.
+        /// </summary>
+        /// <param name="input"></param>


[](start = 12, length = 28)

with "Label" and "Features" columns.
From what I see, we actually don't use Score column at all, we just run predictor on top of feature column to produce "Score" column. So i'm not sure how necessary is to have that Score role in roleMappedData. #Pending

see line 262. We do use the score column.

In reply to: 248115187 [](ancestors = 248115187)

During transform stage. During Fit we train calibrator on combination of Predictor + FeatureColumn and LabelColumn.
But it's just an observation. You don't have to do anything about that.

In reply to: 248152774 [](ancestors = 248152774,248115187)

But you can still put something into empty param, tho

In reply to: 248410187 [](ancestors = 248410187,248152774,248115187)

Ivanidzo4ka · 2019-01-16T00:35:21Z

src/Microsoft.ML.Data/Training/TrainerUtils.cs

@@ -6,6 +6,7 @@
 using System.Collections.Generic;
 using Microsoft.ML.Core.Data;
 using Microsoft.ML.Data;
+using Microsoft.ML.EntryPoints;


This doesn't look right. #Closed

This namespace is in Microsoft.ML.Core. There is no circular dependancy

In reply to: 248115282 [](ancestors = 248115282)

It just strange what we have EntryPoints as dependency. But I guess it's related to fact what Arguments class shared with entrypoints.

In reply to: 248152842 [](ancestors = 248152842,248115282)

Ivanidzo4ka · 2019-01-16T00:36:48Z

src/Microsoft.ML.KMeansClustering/KMeansCatalog.cs

+        /// <summary>
+        /// Train a KMeans++ clustering algorithm.
+        /// </summary>
+        /// <param name="ctx">The regression context trainer object.</param>


regression [](start = 34, length = 10)

clustering. #Resolved

Ivanidzo4ka · 2019-01-16T00:39:18Z

test/Microsoft.ML.Tests/OnnxConversionTest.cs

-                    settings.K = 4;
-                    settings.NumThreads = 1;
-                    settings.InitAlgorithm = Trainers.KMeans.KMeansPlusPlusTrainer.InitAlgorithm.Random;
+                    FeatureColumn = "Features",


Features" [](start = 37, length = 9)

DefaultColumnNames.Features #Resolved

You should told me what I'm wrong and it should be nameof(BreastCancerFeatureVector.Features). or just "Features" since we use same string above. My bad.

In reply to: 248115971 [](ancestors = 248115971)

Ivanidzo4ka · 2019-01-16T00:44:34Z

src/Microsoft.ML.KMeansClustering/KMeansCatalog.cs

+        {
+            Contracts.CheckValue(ctx, nameof(ctx));
+            var env = CatalogUtils.GetEnvironment(ctx);
+            return new KMeansPlusPlusTrainer(env, options);


options [](start = 50, length = 7)

Host.CheckValue(options, nameof(options));
You have this in internal KMeansPlusPlusTrainer(IHostEnvironment env, Options options)
if no one will specify options that check will throw.
Is it works? Do we somehow magically create Options somewhere? #Closed

abgoswam · 2019-01-16T00:48:06Z

src/Microsoft.ML.KMeansClustering/KMeansCatalog.cs

+        /// </summary>
+        /// <param name="ctx">The regression context trainer object.</param>
+        /// <param name="options">Algorithm advanced settings.</param>
+        public static KMeansPlusPlusTrainer KMeans(this ClusteringContext.ClusteringTrainers ctx, KMeansPlusPlusTrainer.Options options = null)


= null [](start = 135, length = 7)

this should be a required parameter i believe #Closed

abgoswam · 2019-01-16T00:48:52Z

src/Microsoft.ML.HalLearners/OlsLinearRegression.cs

@@ -106,7 +106,7 @@ internal OlsLinearRegressionTrainer(IHostEnvironment env, Arguments args)
            advancedSettings?.Invoke(args);
            args.FeatureColumn = featureColumn;
            args.LabelColumn = labelColumn;
-            args.WeightColumn = weightColumn;
+            args.WeightColumn = weightColumn != null ? Optional<string>.Explicit(weightColumn) : Optional<string>.Implicit(DefaultColumnNames.Weight);


weightColumn != null [](start = 32, length = 21)

maybe we should have a separate PR just to fix the weights bug across the entire codebase ?

that way we isolate the API changes from the bug related to Weight. Also, less dependency between PRs related to learner API changes

than we'd need a separate PR to fix the ctor in KMeans .. and this PR does take care of both..

In reply to: 248117752 [](ancestors = 248117752)

my concern is by coupling the weights changes with API changes, we are making ourselves susceptible to inconsistency across PRs etc.

Would be much more convenient if we have consistent story for weights, get that reviewed + checked in and then move on with API fixing .

So I am not sure at this point if I need to make similar changes for the weights bug in my existing PRs or not ? I see we are adding overloads for MakeR4ScalarWeightColumn.

In reply to: 248158333 [](ancestors = 248158333,248117752)

Ivanidzo4ka · 2019-01-16T00:58:38Z

src/Microsoft.ML.StaticPipe/KMeansStatic.cs

+            var rec = new TrainerEstimatorReconciler.Clustering(
+            (env, featuresName, weightsName) =>
+            {
+                options.FeatureColumn = featuresName;


options [](start = 16, length = 7)

you need to check are they null, and create new one if they are. #Closed

that's what happen when you are in a hurry to press publish..

In reply to: 248119409 [](ancestors = 248119409)

Ivanidzo4ka · 2019-01-16T01:01:47Z

src/Microsoft.ML.KMeansClustering/KMeansCatalog.cs

+        /// <example>
+        /// <format type="text/markdown">
+        /// <![CDATA[
+        ///  [!code-csharp[SDCA](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/KMeans_example.cs)]


SDCA [](start = 27, length = 4)

hmm #Closed

Ivanidzo4ka · 2019-01-16T01:02:46Z

src/Microsoft.ML.KMeansClustering/KMeansCatalog.cs

+        /// <example>
+        /// <format type="text/markdown">
+        /// <![CDATA[
+        ///  [!code-csharp[SDCA](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/KMeans_example.cs)]


KMeans_example [](start = 93, length = 14)

name of files is actually KMeans.cs #Closed

abgoswam · 2019-01-16T01:13:05Z

src/Microsoft.ML.StaticPipe/KMeansStatic.cs

+        /// <returns>The predicted output.</returns>
+        public static (Vector<float> score, Key<uint> predictedLabel) KMeans(this ClusteringContext.ClusteringTrainers ctx,
+           Vector<float> features, Scalar<float> weights = null,
+           KMeansPlusPlusTrainer.Options options = null,


the API pattern I am following is that both weights and options are required parameters for the Options API

(based on Tom's example in issue description) #Pending

I have a different question, why do we want to the features, and the weight in this second extension; can we just have it be options, and provide everything in there?
If we provide the weights and the features in two places, than we have to deal with explaining who overrides what. Can we have this second extension have just the Options?

@TomFinley @glebuk

In reply to: 248121755 [](ancestors = 248121755)

Actually, quoting from: #1798

"It is good to have the convenience for the simple arguments, however, if we have both simple and advanced settings, we should not mix them but have instead two distinct constructors/extension methods. (E.g., in the above, we would have two methods, one that took the advanced options.) To do otherwise is to invite confusion about which "wins" if we have the setting set in both.

Note that phase setting "set in both," which suggests that these settings object should retain the "simpler" settings in them. This reinforces feedback elsewhere as seen here."

Seems like the second extension should not have the label, features, weights in the signature; but just Options.

In reply to: 248157402 [](ancestors = 248157402,248121755)

justinormont · 2019-01-16T01:48:58Z

src/Microsoft.ML.FastTree/FastTree.cs

@@ -151,7 +151,7 @@ public abstract class FastTreeTrainerBase<TArgs, TTransformer, TModel> :
        /// Constructor that is used when invoking the classes deriving from this, through maml.
        /// </summary>
        private protected FastTreeTrainerBase(IHostEnvironment env, TArgs args, SchemaShape.Column label)
-            : base(Contracts.CheckRef(env, nameof(env)).Register(RegisterName), TrainerUtils.MakeR4VecFeature(args.FeatureColumn), label, TrainerUtils.MakeR4ScalarWeightColumn(args.WeightColumn, args.WeightColumn.IsExplicit))


What is/was args.WeightColumn.IsExplicit used for? #Resolved

it is something that got introduced with EntryPoints. If the user doesn't specify a name for the optional columns, an Implict argument get created for them... not sure of the original thoughts around why not just leave it null. Maybe to avoid dealing with its serialization across languages.. idk.

In reply to: 248127604 [](ancestors = 248127604)

abgoswam · 2019-01-16T17:26:55Z

test/Microsoft.ML.Tests/TrainerEstimators/TrainerEstimators.cs

@@ -71,8 +71,11 @@ public void KMeansEstimator()


            // Pipeline.
-            var pipeline = new KMeansPlusPlusTrainer(Env, featureColumn, weights: weights,
-                            advancedSettings: s => { s.InitAlgorithm = KMeansPlusPlusTrainer.InitAlgorithm.KMeansParallel; });
+            var pipeline = new KMeansPlusPlusTrainer(Env, new KMeansPlusPlusTrainer.Options {


KMeansPlusPlusTrainer [](start = 31, length = 21)

am surprised... is this invoking the constructor ? #Resolved

nothing wrong with it.. the extensions return the same object too, an IEstimator.

In reply to: 248377514 [](ancestors = 248377514)

wschin · 2019-01-16T17:42:50Z

src/Microsoft.ML.KMeansClustering/KMeansCatalog.cs

-        /// <param name="ctx">The regression context trainer object.</param>
-        /// <param name="features">The features, or independent variables.</param>
+        /// <param name="ctx">The clustering context trainer object.</param>
+        /// <param name="featureColumn">The features, or independent variables.</param>


Maybe featureColumnName? In ML.NET, we mix the meanings of column name, column index, and column itself. I feel our naming can be more specific and therefore less ambiguous. #Pending

I will leave it to featureColumn, because we have had a loooonnnggg discussion about it, and a PR to standartize those.

#1524 (comment)

In reply to: 248383325 [](ancestors = 248383325)

wschin · 2019-01-16T17:46:29Z

src/Microsoft.ML.Data/Prediction/CalibratorCatalog.cs

+        /// </summary>
+        /// <param name="input"></param>
+        /// <returns>A trained <see cref="CalibratorTransformer{TICalibrator}"/> that will transforms the data by adding the
+        /// Probability column.</returns>


Do we have a constant string for the name of a probability column? If yes, it'd be nice to make a reference here (<see cref="">). #Resolved

wschin · 2019-01-16T17:48:07Z

src/Microsoft.ML.Data/Prediction/CalibratorCatalog.cs

@@ -122,6 +122,12 @@ SchemaShape IEstimator<CalibratorTransformer<TICalibrator>>.GetOutputSchema(Sche
            return new SchemaShape(outColumns.Values);
        }

+        /// <summary>
+        /// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>.


Suggested change

/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>.

/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/> which may transform the score column to a probability column.

In addition, do we have specific names for score and probability columns? If yes, we can mention those constants here. #Resolved

Ivanidzo4ka · 2019-01-16T18:45:02Z

src/Microsoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs

            [Argument(ArgumentType.AtMostOnce, HelpText = "Tolerance parameter for trainer convergence. Low = slower, more accurate",
-                ShortName = "ot")]
+                Name = "OptTol", ShortName = "ot")]


"ot" [](start = 45, length = 4)

you can enumerate ShortNames via comma "opttol,ot" #Resolved

Apparently name is for maml, and i set it to the previous name for backwards compatibility.

In reply to: 248405346 [](ancestors = 248405346)

Ivanidzo4ka · 2019-01-16T18:45:41Z

src/Microsoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs

+            /// <summary>
+            /// The number of clusters.
+            /// </summary>
+            [Argument(ArgumentType.AtMostOnce, HelpText = "The number of clusters", SortOrder = 50, Name = "K")]


Name [](start = 100, length = 4)

ShortName maybe? it's a "K" is quite short, right? #Resolved

Name is for backwards compatibility. This is what maml uses.

In reply to: 248405583 [](ancestors = 248405583)

ShortNames for same thing, and as @najeeb-kazmi discover yesterday, Name is not always work, but ShortName do (or maybe combination of ShortName + Name).

In reply to: 248405874 [](ancestors = 248405874,248405583)

Ivanidzo4ka · 2019-01-16T18:46:47Z

src/Microsoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs

+            /// Memory budget (in MBs) to use for KMeans acceleration.
+            /// </summary>
+            [Argument(ArgumentType.AtMostOnce, HelpText = "Memory budget (in MBs) to use for KMeans acceleration",
+                Name = "AccelMemBudgetMb", ShortName = "accelMemBudgetMb")]


Name = "AccelMemBudgetMb" [](start = 16, length = 25)

I think we lowercase shortnames and name anyway, so I don't think you need to add this Name parameter #Pending

Let's keep it to clearly documents the previous name? Najeeb has been doing the same on his PR of pluralizing some args names.

In reply to: 248405947 [](ancestors = 248405947)

Ivanidzo4ka · 2019-01-16T18:49:30Z

src/Microsoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs

-
-        internal KMeansPlusPlusTrainer(IHostEnvironment env, Arguments args)
-            : this(env, args, null)
+        /// <param name="options">The advanced arguments of the algorithm.</param>


/// The advanced arguments of the algorithm. [](start = 7, length = 75)

from KMeansCatalog.cs
/// <param name="options">Algorithm advanced settings.</param>
I like mix of arguments, options and settings. Anyone who is favor of one of them get something :) #Resolved

inclusive !

In reply to: 248406901 [](ancestors = 248406901)

Ivanidzo4ka · 2019-01-16T19:04:07Z

src/Microsoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs

        }

-        public class Arguments : UnsupervisedLearnerInputBaseWithWeight
+        public class Options : UnsupervisedLearnerInputBaseWithWeight


Options [](start = 21, length = 7)

Just an observation.
I found it's funny what we diverge during ITransformer/IEstimator conversion and for what used to be transforms we come up with ColumnInfo and we no longer rely on arguments. (We still have conversion from arguments to columnInfo for EntryPoints)

Eventually half of our entrypoints would become options, and other half remain arguments.

good point... and a bummer. Does it make sense to have a plan to reconcile?

In reply to: 248412000 [](ancestors = 248412000)

Ivanidzo4ka

abgoswam · 2019-01-16T21:23:20Z

src/Microsoft.ML.StaticPipe/KMeansStatic.cs

+                {
+                    FeatureColumn = featuresName,
+                    ClustersCount = clustersCount,
+                    WeightColumn = weightsName != null ? Optional<string>.Explicit(weightsName) : Optional<string>.Implicit(DefaultColumnNames.Weight)


weightsName != null ? Optional.Explicit(weightsName) : Optional.Implicit(DefaultColumnNames.Weight) [](start = 35, length = 115)

do we need to add this ?

KMeans and Implicit weight cleanup

50db1a5

sfilipi requested review from TomFinley, wschin, glebuk, abgoswam, Ivanidzo4ka and yaeldekel January 16, 2019 00:26

Ivanidzo4ka reviewed Jan 16, 2019

View reviewed changes

abgoswam reviewed Jan 16, 2019

View reviewed changes

Ivanidzo4ka reviewed Jan 16, 2019

View reviewed changes

abgoswam reviewed Jan 16, 2019

View reviewed changes

justinormont reviewed Jan 16, 2019

View reviewed changes

sfilipi added 2 commits January 15, 2019 23:47

addressing PR comments and fixing failing tests

a90f1c6

small fix

6c76165

dotnet deleted a comment from abgoswam Jan 16, 2019

abgoswam reviewed Jan 16, 2019

View reviewed changes

wschin reviewed Jan 16, 2019

View reviewed changes

Ivanidzo4ka reviewed Jan 16, 2019

View reviewed changes

Ivanidzo4ka approved these changes Jan 16, 2019

View reviewed changes

wordsmithing per request.

87149c6

abgoswam approved these changes Jan 16, 2019

View reviewed changes

sfilipi merged commit b89ce70 into dotnet:master Jan 16, 2019

sfilipi deleted the 1798KMeansHalLearners branch January 16, 2019 20:33

abgoswam reviewed Jan 16, 2019

View reviewed changes

ghost locked as resolved and limited conversation to collaborators Mar 25, 2022

	/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/>.
	/// Fits the scored <see cref="IDataView"/> creating a <see cref="CalibratorTransformer{TICalibrator}"/> which may transform the score column to a probability column.

KMeans and Implicit weight cleanup #2158

KMeans and Implicit weight cleanup #2158

Conversation

sfilipi commented Jan 16, 2019

Ivanidzo4ka Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

sfilipi Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

abgoswam Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

abgoswam Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abgoswam Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

abgoswam Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

sfilipi Jan 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinormont Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abgoswam Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wschin Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wschin Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

wschin Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 • edited by sfilipi Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ivanidzo4ka Jan 16, 2019 •

edited

Loading

Ivanidzo4ka Jan 16, 2019 •

edited

Loading

sfilipi Jan 16, 2019 •

edited

Loading

Ivanidzo4ka Jan 16, 2019 •

edited by sfilipi

Loading

Ivanidzo4ka Jan 16, 2019 •

edited by sfilipi

Loading

Ivanidzo4ka Jan 16, 2019 •

edited

Loading

abgoswam Jan 16, 2019 •

edited

Loading

abgoswam Jan 16, 2019 •

edited

Loading

abgoswam Jan 16, 2019 •

edited

Loading

Ivanidzo4ka Jan 16, 2019 •

edited

Loading

Ivanidzo4ka Jan 16, 2019 •

edited

Loading

Ivanidzo4ka Jan 16, 2019 •

edited

Loading

abgoswam Jan 16, 2019 •

edited by sfilipi

Loading

sfilipi Jan 16, 2019 •

edited

Loading

justinormont Jan 16, 2019 •

edited by sfilipi

Loading

abgoswam Jan 16, 2019 •

edited by sfilipi

Loading

wschin Jan 16, 2019 •

edited by sfilipi

Loading

wschin Jan 16, 2019 •

edited by sfilipi

Loading

wschin Jan 16, 2019 •

edited by sfilipi

Loading

Ivanidzo4ka Jan 16, 2019 •

edited by sfilipi

Loading

Ivanidzo4ka Jan 16, 2019 •

edited by sfilipi

Loading

Ivanidzo4ka Jan 16, 2019 •

edited by sfilipi

Loading

Ivanidzo4ka Jan 16, 2019 •

edited by sfilipi

Loading