Skip to content

Scrub Fast Tree Family #2753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 5, 2019
Merged

Scrub Fast Tree Family #2753

merged 12 commits into from
Mar 5, 2019

Conversation

wschin
Copy link
Member

@wschin wschin commented Feb 27, 2019

This PR fixes #2617 (also fixes #2375, fixes #2857) and is also a part of #2613.

Cleaned items:

  • FastTree.cs
  • FastTreeArguments.cs
  • Derived fast tree classes
  • Gam family

@wschin wschin added the API Issues pertaining the friendly API label Feb 27, 2019
@wschin wschin self-assigned this Feb 27, 2019
// Use the path object to pass to GetLeaf, which will populate path with the IDs of th nodes from root to leaf.
List<int> path = default;
// Get the ID of the leaf this example ends up in tree 0.
var leafID = modelParams.GetLeaf(0, in testRow, ref path);
Copy link
Member Author

@wschin wschin Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed because our public APIs doesn't provide those functions. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But you can still traverse through trees with new structure, right?


In reply to: 260562343 [](ancestors = 260562343)

Copy link
Member Author

@wschin wschin Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we cannot but this definitely should be hidden here. @[email protected], do you think this feature is required in tree's public APIs?


In reply to: 261011054 [](ancestors = 261011054,260562343)

Copy link
Contributor

@TomFinley TomFinley Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the trees should enable someone to explain what the tree is. I wouldn't make evaluation code part of the API. Any more than we insist that linear model parameters expose "do dot product" as a method on the model parameters. Both things are potentially useful of course, but we don't need it as part of our public API (at least at first).


In reply to: 261417026 [](ancestors = 261417026,261011054,260562343)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I feel evaluation module should be independent to the expression of the trained objects. Thanks you!


In reply to: 261734289 [](ancestors = 261734289,261417026,261011054,260562343)

[Argument(ArgumentType.LastOccurenceWins, HelpText = "The seed of the random number generator", ShortName = "r1")]
public int RngSeed = 123;
[Argument(ArgumentType.LastOccurenceWins, HelpText = "The seed of the random number generator", ShortName = "r1,RngSeed")]
public int RandomSeed = 123;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RandomSeed [](start = 19, length = 10)

I think everywhere else we call it just seed. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


In reply to: 260562756 [](ancestors = 260562756)

[TGUI(NotGui = true)]
public int FeatureSelectSeed = 123;
public int FeatureSelectionRandomSeed = 123;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureSelectionRandomSeed [](start = 19, length = 26)

Do we really need two of them? #Resolved

Copy link
Contributor

@TomFinley TomFinley Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of. There are two degrees of control that internal people wanted to independently vary or not vary: selection of features and selection of data. I have not kept touch with this set of people over the years but the reasoning at one point was definitely considered important. Let us not change it. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes sir.


In reply to: 260863640 [](ancestors = 260863640)

[Argument(ArgumentType.LastOccurenceWins, HelpText = "Minimum categorical docs percentage in a bin to consider for a split.", ShortName = "mdop")]
public double MinDocsPercentageForCategoricalSplit = 0.001;
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Minimum categorical docs percentage in a bin to consider for a split.", ShortName = "mdop,MinDocsPercentageForCategoricalSplit")]
public double MinExamplePercentageForCategoricalSplit = 0.001;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Min [](start = 22, length = 3)

Minimum #Resolved

[Argument(ArgumentType.LastOccurenceWins, HelpText = "Minimum categorical docs percentage in a bin to consider for a split.", ShortName = "mdop")]
public double MinDocsPercentageForCategoricalSplit = 0.001;
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Minimum categorical docs percentage in a bin to consider for a split.", ShortName = "mdop,MinDocsPercentageForCategoricalSplit")]
public double MinExamplePercentageForCategoricalSplit = 0.001;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.001 [](start = 64, length = 5)

is it 0.001% or 0.1% ? #Resolved

Copy link
Member Author

@wschin wschin Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.1%. Docs are updated to all Percentage and Fraction arguments.


In reply to: 260563129 [](ancestors = 260563129)

/// <summary>
/// Copy the weights of all training features to <paramref name="weights"/>.
/// </summary>
public void GetFeatureWeights(ref VBuffer<float> weights)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this should return IReadOnlyList<float>.

[Argument(ArgumentType.LastOccurenceWins, HelpText = "Maximum number of distinct values (bins) per feature", ShortName = "mb")]
public int MaxBins = 255; // save one for undefs
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Maximum number of distinct values (bins) per feature", ShortName = "mb,MaxBins")]
public int MaxBinCountPerFeature = 255; // save one for undefs
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max [](start = 19, length = 3)

Maximum #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 27, 2019

    /// Print execution time breakdown to stdout.

are we really print into stdout or we use our logging mechanism? #Resolved


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:301 in 3257470. [](commit_id = 3257470, deletion_comment = False)

[Argument(ArgumentType.AtMostOnce, HelpText = "The fraction of features (chosen randomly) to use on each split", ShortName = "sf")]
public Double SplitFraction = 1;
[Argument(ArgumentType.AtMostOnce, HelpText = "The fraction of features (chosen randomly) to use on each split", ShortName = "sf,SplitFraction")]
public Double FeatureFractionPerSplit = 1;

/// <summary>
/// Smoothing paramter for tree regularization.
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paramter [](start = 22, length = 8)

parameter #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 27, 2019

    public bool PrintTrainValidGraph;

is this remnants of TLC TV ?
Do we need to expose them? #Resolved


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:410 in d64a4ae. [](commit_id = d64a4ae, deletion_comment = False)

@TomFinley
Copy link
Contributor

TomFinley commented Feb 27, 2019

    public bool PrintTrainValidGraph;

is this remnants of TLC TV ?
Do we need to expose them?

Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:410 in d64a4ae. [](commit_id = d64a4ae, deletion_comment = False)

It has nothing to do with TLC TV. It is however important for command line applications, less so for API. Let us make internal for now. #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Feb 28, 2019

    public RegressionTreeEnsemble TrainedTreeEnsemble { get; }

I've looked into RegressionTree.cs and some things like:
NumLeaves, NumNodes, GtChild, LteChild doesn't sound like great names
Shall we address that in this PR as well? #Resolved


Refers to: src/Microsoft.ML.FastTree/FastTree.cs:3393 in 3257470. [](commit_id = 3257470, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Feb 28, 2019

    public RegressionTreeEnsemble TrainedTreeEnsemble { get; }

Fixed as well. Feel strange if they are in two PRs.


In reply to: 468095313 [](ancestors = 468095313)


Refers to: src/Microsoft.ML.FastTree/FastTree.cs:3393 in 3257470. [](commit_id = 3257470, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Feb 28, 2019

    public bool PrintTrainValidGraph;

I will do internal-best-friend.


In reply to: 467689803 [](ancestors = 467689803)


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:410 in d64a4ae. [](commit_id = d64a4ae, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Feb 28, 2019

Sure.


In reply to: 467958748 [](ancestors = 467958748)

@wschin
Copy link
Member Author

wschin commented Feb 28, 2019

    /// Print execution time breakdown to stdout.

We use ch defined in

           using (var ch = Host.Start("Training"))
            {
                ch.CheckValue(trainData, nameof(trainData));
                trainData.CheckBinaryLabel();
                trainData.CheckFeatureFloatVector();
                trainData.CheckOptFloatWeight();
                FeatureCount = trainData.Schema.Feature.Value.Type.GetValueCount();
                ConvertData(trainData);
                TrainCore(ch);
            }

In reply to: 467689294 [](ancestors = 467689294)


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:301 in d64a4ae. [](commit_id = d64a4ae, deletion_comment = False)

@wschin wschin changed the title [WIP] Scrub Fast Tree Family Scrub Fast Tree Family Feb 28, 2019
@@ -14,7 +14,6 @@
using Microsoft.ML.CommandLine;
using Microsoft.ML.Data;
using Microsoft.ML.EntryPoints;
using Microsoft.ML.Internal.Internallearn;
Copy link
Member

@singlis singlis Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you merged with master? I thought that InternalLearn was removed in a prior change. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, maybe rebase did something.


In reply to: 261456781 [](ancestors = 261456781)

@singlis
Copy link
Member

singlis commented Mar 1, 2019

            bestIteration = Ensemble.NumTrees;

Changing this to NumberOfTrees is probably out of scope for this PR right? #Resolved


Refers to: src/Microsoft.ML.FastTree/BoostingFastTree.cs:143 in 3257470. [](commit_id = 3257470, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 1, 2019

            bestIteration = Ensemble.NumTrees;

Let me double-check. Thanks!


In reply to: 468512482 [](ancestors = 468512482)


Refers to: src/Microsoft.ML.FastTree/BoostingFastTree.cs:143 in 3257470. [](commit_id = 3257470, deletion_comment = False)

@singlis
Copy link
Member

singlis commented Mar 1, 2019

        string featureColumn,

suggestion:
I know this constructor is not exposed publicly, but I renamed these parameters (featureColumn->featureColumnName, etc) in lightgbm so that they aligned with the parameters that are used in the catalog function. And technically they will align with the parameters in the Options class. #Resolved


Refers to: src/Microsoft.ML.FastTree/FastTree.cs:104 in 3257470. [](commit_id = 3257470, deletion_comment = False)

@@ -3230,6 +3231,9 @@ private void ToCSharp(InternalRegressionTree tree, TextWriter writer, int node,
}
}

/// <summary>
/// Copy the weights of all training features to <paramref name="weights"/>.
/// </summary>
Copy link
Member

@singlis singlis Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

</ [](start = 12, length = 2)

Missing pararm for weights #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will add

        /// <param name="weights">a <see cref="VBuffer{T}"/> where feature weights would be assigned to.
        /// The i-th element in <paramref name="weights"/> stores the weight of the i-th feature.</param>

In reply to: 261459910 [](ancestors = 261459910)

int numTrees = Defaults.NumTrees,
int minDatapointsInLeaves = Defaults.MinDocumentsInLeaves,
double learningRate = Defaults.LearningRates)
int numLeaves = Defaults.NumberOfLeaves,
Copy link
Member

@singlis singlis Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numLeaves [](start = 16, length = 9)

same suggestion here to use the full names for the arguments. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.


In reply to: 261461521 [](ancestors = 261461521)

@singlis
Copy link
Member

singlis commented Mar 1, 2019

    internal BinaryClassificationGamTrainer(IHostEnvironment env, Options options)

Should this be renamed to GamBinaryClassificationTrainer? I know there is an issue that is tracking consistent naming with our algorithms, so probably not worth doing in this pr.
#WontFix


Refers to: src/Microsoft.ML.FastTree/GamClassification.cs:52 in 3257470. [](commit_id = 3257470, deletion_comment = False)

@singlis
Copy link
Member

singlis commented Mar 1, 2019

        /// wrong, this method will throw.

I dont think this comment is right - this doesnt look like it will throw. #Resolved


Refers to: src/Microsoft.ML.FastTree/GamModelParameters.cs:865 in 3257470. [](commit_id = 3257470, deletion_comment = False)

int numTrees = Defaults.NumTrees,
int minDatapointsInLeaves = Defaults.MinDocumentsInLeaves,
double learningRate = Defaults.LearningRates,
int numLeaves = Defaults.NumberOfLeaves,
Copy link
Member

@singlis singlis Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numLeaves [](start = 16, length = 9)

Shouldnt these parameters also be fully named? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I didn't realize they are FastTree static APIs.


In reply to: 261464899 [](ancestors = 261464899)

@@ -63,14 +63,14 @@ public abstract class RegressionTreeBase
/// bitwise complement operator in C#; for details, see
/// https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/operators/bitwise-complement-operator.
/// </summary>
public IReadOnlyList<int> LteChild => _lteChild;
public IReadOnlyList<int> LessThanOrEqualToThresholdChildren => _lteChild;
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LessThanOrEqualToThresholdChildren [](start = 34, length = 34)

It looks like across lightgbm, catboost and whatever they use in scikit this one called: leftChildrends, or leftChild, another one is rightChild. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do leftChild for this.


In reply to: 261814959 [](ancestors = 261814959)

@wschin
Copy link
Member Author

wschin commented Mar 4, 2019

        public int LambdaMartMaxTruncation = 100;

I think so. Can we remove "k" in top-k accuracy? No, we can't.


In reply to: 468890315 [](ancestors = 468890315)


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:93 in ee67d50. [](commit_id = ee67d50, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 4, 2019

        // REVIEW: Hiding sorting for now. Should be an enum or component factory.

Will make it internal.


In reply to: 468890113 [](ancestors = 468890113)


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:83 in ee67d50. [](commit_id = ee67d50, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 4, 2019

        public bool DistanceWeight2;

As suggested by @[email protected], this will be internal.


In reply to: 468890167 [](ancestors = 468890167)


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:105 in ee67d50. [](commit_id = ee67d50, deletion_comment = False)

@wschin
Copy link
Member Author

wschin commented Mar 4, 2019

        public string CustomGains = "0,3,7,15,31";

Will be included in the next iteration. (iter 9)


In reply to: 468890070 [](ancestors = 468890070)


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:78 in ee67d50. [](commit_id = ee67d50, deletion_comment = False)

@@ -129,12 +273,12 @@ internal override void Check(IExceptionContext ectx)
#if OLD_DATALOAD
ectx.CheckUserArg(0 <= secondaryMetricShare && secondaryMetricShare <= 1, "secondaryMetricShare", "secondaryMetricShare must be between 0 and 1.");
#endif
ectx.CheckUserArg(0 < LambdaMartMaxTruncation, nameof(LambdaMartMaxTruncation), "lambdaMartMaxTruncation must be positive.");
ectx.CheckUserArg(0 < NdcgTruncationLevel, nameof(NdcgTruncationLevel), "lambdaMartMaxTruncation must be positive.");
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lambdaMartMaxTruncation [](start = 89, length = 23)

no need for this one. you already use nameof #Resolved

/// <summary>
/// Early stopping metrics.
/// </summary>
public EarlyStoppingMetric EarlyStoppingMetric
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EarlyStoppingMetric [](start = 19, length = 19)

From comments for previous option it looks like it should be applicable only for regression and ranking problem. Does it actually works for binary classification? #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly know but conceptually yes. It can be used to measure gradient norm.


In reply to: 262295829 [](ancestors = 262295829)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 5, 2019

    public int EarlyStoppingMetrics;

internal. otherwise i would see two places in which I can set this up. this one and EarlyStoppingMetric #Resolved


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:633 in 157aedc. [](commit_id = 157aedc, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 5, 2019

    public string BaselineScoresFormula;

make this one internal and next two below it.
In order to work with FreeForms we need to ship "NeuralTreeEvaluator.dll" which we can't expose. #Resolved


Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:737 in 157aedc. [](commit_id = 157aedc, deletion_comment = False)

@@ -9096,18 +9057,6 @@
"IsNullable": false,
"Default": false
},
{
"Name": "MaxTreesAfterCompression",
"Type": "Int",
Copy link
Member

@ganik ganik Mar 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this got deleted? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used at all.


In reply to: 262674075 [](ancestors = 262674075)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@wschin wschin merged commit 95019e5 into dotnet:master Mar 5, 2019
@wschin wschin deleted the scrub-tree branch March 5, 2019 22:18
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GAM parameter names are non-standard Scrubbing FastTree learners Parameter name Consistency in FastTree
5 participants