-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Scrub Fast Tree Family #2753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scrub Fast Tree Family #2753
Conversation
// Use the path object to pass to GetLeaf, which will populate path with the IDs of th nodes from root to leaf. | ||
List<int> path = default; | ||
// Get the ID of the leaf this example ends up in tree 0. | ||
var leafID = modelParams.GetLeaf(0, in testRow, ref path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed because our public APIs doesn't provide those functions. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you can still traverse through trees with new structure, right?
In reply to: 260562343 [](ancestors = 260562343)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we cannot but this definitely should be hidden here. @[email protected], do you think this feature is required in tree's public APIs?
In reply to: 261011054 [](ancestors = 261011054,260562343)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the trees should enable someone to explain what the tree is. I wouldn't make evaluation code part of the API. Any more than we insist that linear model parameters expose "do dot product" as a method on the model parameters. Both things are potentially useful of course, but we don't need it as part of our public API (at least at first).
In reply to: 261417026 [](ancestors = 261417026,261011054,260562343)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I feel evaluation module should be independent to the expression of the trained objects. Thanks you!
In reply to: 261734289 [](ancestors = 261734289,261417026,261011054,260562343)
[Argument(ArgumentType.LastOccurenceWins, HelpText = "The seed of the random number generator", ShortName = "r1")] | ||
public int RngSeed = 123; | ||
[Argument(ArgumentType.LastOccurenceWins, HelpText = "The seed of the random number generator", ShortName = "r1,RngSeed")] | ||
public int RandomSeed = 123; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RandomSeed [](start = 19, length = 10)
I think everywhere else we call it just seed
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[TGUI(NotGui = true)] | ||
public int FeatureSelectSeed = 123; | ||
public int FeatureSelectionRandomSeed = 123; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureSelectionRandomSeed [](start = 19, length = 26)
Do we really need two of them? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of. There are two degrees of control that internal people wanted to independently vary or not vary: selection of features and selection of data. I have not kept touch with this set of people over the years but the reasoning at one point was definitely considered important. Let us not change it. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Minimum categorical docs percentage in a bin to consider for a split.", ShortName = "mdop")] | ||
public double MinDocsPercentageForCategoricalSplit = 0.001; | ||
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Minimum categorical docs percentage in a bin to consider for a split.", ShortName = "mdop,MinDocsPercentageForCategoricalSplit")] | ||
public double MinExamplePercentageForCategoricalSplit = 0.001; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Min [](start = 22, length = 3)
Minimum #Resolved
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Minimum categorical docs percentage in a bin to consider for a split.", ShortName = "mdop")] | ||
public double MinDocsPercentageForCategoricalSplit = 0.001; | ||
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Minimum categorical docs percentage in a bin to consider for a split.", ShortName = "mdop,MinDocsPercentageForCategoricalSplit")] | ||
public double MinExamplePercentageForCategoricalSplit = 0.001; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.001 [](start = 64, length = 5)
is it 0.001% or 0.1% ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.1%. Docs are updated to all Percentage and Fraction arguments.
In reply to: 260563129 [](ancestors = 260563129)
/// <summary> | ||
/// Copy the weights of all training features to <paramref name="weights"/>. | ||
/// </summary> | ||
public void GetFeatureWeights(ref VBuffer<float> weights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this should return IReadOnlyList<float>
.
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Maximum number of distinct values (bins) per feature", ShortName = "mb")] | ||
public int MaxBins = 255; // save one for undefs | ||
[Argument(ArgumentType.LastOccurenceWins, HelpText = "Maximum number of distinct values (bins) per feature", ShortName = "mb,MaxBins")] | ||
public int MaxBinCountPerFeature = 255; // save one for undefs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Max [](start = 19, length = 3)
Maximum #Resolved
[Argument(ArgumentType.AtMostOnce, HelpText = "The fraction of features (chosen randomly) to use on each split", ShortName = "sf")] | ||
public Double SplitFraction = 1; | ||
[Argument(ArgumentType.AtMostOnce, HelpText = "The fraction of features (chosen randomly) to use on each split", ShortName = "sf,SplitFraction")] | ||
public Double FeatureFractionPerSplit = 1; | ||
|
||
/// <summary> | ||
/// Smoothing paramter for tree regularization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paramter [](start = 22, length = 8)
parameter #Resolved
It has nothing to do with TLC TV. It is however important for command line applications, less so for API. Let us make internal for now. #Resolved |
I've looked into RegressionTree.cs and some things like: Refers to: src/Microsoft.ML.FastTree/FastTree.cs:3393 in 3257470. [](commit_id = 3257470, deletion_comment = False) |
Sure. In reply to: 467958748 [](ancestors = 467958748) |
We use using (var ch = Host.Start("Training"))
{
ch.CheckValue(trainData, nameof(trainData));
trainData.CheckBinaryLabel();
trainData.CheckFeatureFloatVector();
trainData.CheckOptFloatWeight();
FeatureCount = trainData.Schema.Feature.Value.Type.GetValueCount();
ConvertData(trainData);
TrainCore(ch);
} In reply to: 467689294 [](ancestors = 467689294) Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:301 in d64a4ae. [](commit_id = d64a4ae, deletion_comment = False) |
@@ -14,7 +14,6 @@ | |||
using Microsoft.ML.CommandLine; | |||
using Microsoft.ML.Data; | |||
using Microsoft.ML.EntryPoints; | |||
using Microsoft.ML.Internal.Internallearn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you merged with master? I thought that InternalLearn was removed in a prior change. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Refers to: src/Microsoft.ML.FastTree/FastTree.cs:104 in 3257470. [](commit_id = 3257470, deletion_comment = False) |
@@ -3230,6 +3231,9 @@ private void ToCSharp(InternalRegressionTree tree, TextWriter writer, int node, | |||
} | |||
} | |||
|
|||
/// <summary> | |||
/// Copy the weights of all training features to <paramref name="weights"/>. | |||
/// </summary> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
</ [](start = 12, length = 2)
Missing pararm for weights #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will add
/// <param name="weights">a <see cref="VBuffer{T}"/> where feature weights would be assigned to.
/// The i-th element in <paramref name="weights"/> stores the weight of the i-th feature.</param>
In reply to: 261459910 [](ancestors = 261459910)
int numTrees = Defaults.NumTrees, | ||
int minDatapointsInLeaves = Defaults.MinDocumentsInLeaves, | ||
double learningRate = Defaults.LearningRates) | ||
int numLeaves = Defaults.NumberOfLeaves, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numLeaves [](start = 16, length = 9)
same suggestion here to use the full names for the arguments. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be renamed to GamBinaryClassificationTrainer? I know there is an issue that is tracking consistent naming with our algorithms, so probably not worth doing in this pr. Refers to: src/Microsoft.ML.FastTree/GamClassification.cs:52 in 3257470. [](commit_id = 3257470, deletion_comment = False) |
int numTrees = Defaults.NumTrees, | ||
int minDatapointsInLeaves = Defaults.MinDocumentsInLeaves, | ||
double learningRate = Defaults.LearningRates, | ||
int numLeaves = Defaults.NumberOfLeaves, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numLeaves [](start = 16, length = 9)
Shouldnt these parameters also be fully named? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I didn't realize they are FastTree static APIs.
In reply to: 261464899 [](ancestors = 261464899)
@@ -63,14 +63,14 @@ public abstract class RegressionTreeBase | |||
/// bitwise complement operator in C#; for details, see | |||
/// https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/operators/bitwise-complement-operator. | |||
/// </summary> | |||
public IReadOnlyList<int> LteChild => _lteChild; | |||
public IReadOnlyList<int> LessThanOrEqualToThresholdChildren => _lteChild; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LessThanOrEqualToThresholdChildren [](start = 34, length = 34)
It looks like across lightgbm, catboost and whatever they use in scikit this one called: leftChildrends, or leftChild, another one is rightChild. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested by @[email protected], this will be internal. In reply to: 468890167 [](ancestors = 468890167) Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:105 in ee67d50. [](commit_id = ee67d50, deletion_comment = False) |
@@ -129,12 +273,12 @@ internal override void Check(IExceptionContext ectx) | |||
#if OLD_DATALOAD | |||
ectx.CheckUserArg(0 <= secondaryMetricShare && secondaryMetricShare <= 1, "secondaryMetricShare", "secondaryMetricShare must be between 0 and 1."); | |||
#endif | |||
ectx.CheckUserArg(0 < LambdaMartMaxTruncation, nameof(LambdaMartMaxTruncation), "lambdaMartMaxTruncation must be positive."); | |||
ectx.CheckUserArg(0 < NdcgTruncationLevel, nameof(NdcgTruncationLevel), "lambdaMartMaxTruncation must be positive."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lambdaMartMaxTruncation [](start = 89, length = 23)
no need for this one. you already use nameof #Resolved
/// <summary> | ||
/// Early stopping metrics. | ||
/// </summary> | ||
public EarlyStoppingMetric EarlyStoppingMetric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EarlyStoppingMetric [](start = 19, length = 19)
From comments for previous option it looks like it should be applicable only for regression and ranking problem. Does it actually works for binary classification? #Pending
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly know but conceptually yes. It can be used to measure gradient norm.
In reply to: 262295829 [](ancestors = 262295829)
make this one internal and next two below it. Refers to: src/Microsoft.ML.FastTree/FastTreeArguments.cs:737 in 157aedc. [](commit_id = 157aedc, deletion_comment = False) |
@@ -9096,18 +9057,6 @@ | |||
"IsNullable": false, | |||
"Default": false | |||
}, | |||
{ | |||
"Name": "MaxTreesAfterCompression", | |||
"Type": "Int", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this got deleted? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR fixes #2617 (also fixes #2375, fixes #2857) and is also a part of #2613.
Cleaned items: