Skip to content

Changed default NGrams for FeaturizerText from 1 to 2 #5243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion src/Microsoft.ML.FastTree/FastTreeArguments.cs
Original file line number Diff line number Diff line change
Expand Up @@ -270,12 +270,18 @@ public sealed class Options : BoostedTreeOptions, IFastTreeTrainerFactory
[TGUI(NotGui = true)]
internal bool NormalizeQueryLambdas;

/// <summary>
/// Column to use for example groupId.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Column to use for example groupId", ShortName = "groupId", SortOrder = 5, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly)]
public new string RowGroupColumnName = "GroupId";

/// <summary>
/// internal state of <see cref="EarlyStoppingMetric"/>. It should be always synced with
/// <see cref="BoostedTreeOptions.EarlyStoppingMetrics"/>.
/// </summary>
// Disable 649 because Visual Studio can't detect its assignment via property.
#pragma warning disable 649
#pragma warning disable 649
private EarlyStoppingRankingMetric _earlyStoppingMetric;
#pragma warning restore 649

Expand Down
6 changes: 6 additions & 0 deletions src/Microsoft.ML.LightGbm/LightGbmRankingTrainer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,12 @@ public enum EvaluateMetricType
ShortName = "em")]
public EvaluateMetricType EvaluationMetric = EvaluateMetricType.NormalizedDiscountedCumulativeGain;

/// <summary>
/// Column to use for example groupId.
/// </summary>
[Argument(ArgumentType.AtMostOnce, HelpText = "Column to use for example groupId", ShortName = "groupId", SortOrder = 5, Visibility = ArgumentAttribute.VisibilityType.EntryPointsOnly)]
public new string RowGroupColumnName = "GroupId";

static Options()
{
NameMapping.Add(nameof(CustomGains), "label_gain");
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ public abstract class OnlineLinearOptions : TrainerInputBaseWithLabel
[BestFriend]
internal class OnlineDefault
{
public const int NumberOfIterations = 1;
public const int NumberOfIterations = 10;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two smaller PRs will be easier to review.

Together, this is modifying 105 files by changing both the default ngram length, and AveragedPerceptron's default iterations.

For instance, the AveragedPerceptron change should only affect baseline outputs where AveragedPerceptron is used (including in ensembles). But I see some changes in SVM files, which shouldn't change due to the AP change, but might have a TextTransform in the pipeline. It is just hard to disentangle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am going to be breaking this up. We seem to have a memory issue with the NGram stuff, so I will be splitting these out to get these in while I work on that.

You said the SVM files shoudln't change, but might have a TextTransform in the pipeline. If they do have a text transform should they change? Or should we only see changes where AveragedPerceptron is used and no where else? Or, is this change a bad thing for svm as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expected output changes

This PR is changing both AveragedPerceptron and the text transform. If the pipeline has either of these components, I would expect the pipeline's output to change.

Any pipeline with neither AveragedPerceptron nor the text transform, should have the same output as before. Modifying the other trainers' iteration counts is not benchmarked and, by default, I would assume will not be beneficial.

Memory footprint

The memory footprint will increase and is absolutely worth it for the accuracy gains of our default pipeline. The pipeline being implemented in this PR, Bigrams+tricharactergrams w/ AveragedPerceptron{iter=10}, is the default in TLC and ML․NET AutoML for text and is highly tested/benchmarked. It's the result of around 50-150k runs on the combinatorics of a variety of datasets; primarily testing for ngram lengths, trainers, and hyperparameters. This pipeline was found to be an optimal default. Other featurization options help on specific datasets (see: list).

The memory increase will occur in a few places, (1) in the text transform's dictionary where it stores all seen ngrams, (2) in the iDataView cache if used, (3) in tree trainers at the dataset transpose step for row-wise to column-wise, and (4) in high multi-class linear trainers which store the weight for each feature times each class.

In the disentangled PR handing the ngram change, we should dig into any memory issues which arise. It may be a limitation of our build/testing platform.

Perhaps in the future we should write a doc on tips for low memory model training and serving.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed responses @justinormont.

I will check the files that were changed again and make sure that they all have either AveragedPerceptron or a text transform. If I find any that don't I will change the location where I am overwriting that value.

I think I have the memory increase issue figured out, but I am still looking into it. Yes, the memory should go up, but it was exploding way faster then it should. A 70 Kb file (250 lines) was exploding into 1.7 GB per Ngram length during training. So when the NGram length was 2, it was using 3.4 GB for that small file. It appears to be because we have implemented it so it creates all the possible memory upfront, whether it is being used or not. I have changed that and the memory usage is much smaller now. Still verifying that my changes had no impact on any other tests though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying within OnlineLinear.cs seems to be affecting multiple trainers besides AveragedPerceptron. You may need to find a way to target just AveragedPerceptron.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that. Question though, would it be bad to update it for them all? Or is just that we know for AveragedPerceptron 10 is better, but thats not true for the others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other trainers, it's unknown if 10 iterations will be better or worse. For most trainers, NumberOfIterations/NumberOfTrees and LearningRate are the primary hyperparameters; changing them has a large impact.

Since the existing defaults in the other trainers are backed by some level of benchmarking when they were created, I would keep with the existing value. Moving from 1 to 10 iterations, we know the runtime will increase and we should demonstrate the value (accuracy gain) of this extra time spent.

For AveragedPerceptron, we did benchmarks showing the gain in having 10 iterations as the new default.

Subset of the AveragedPerceptron benchmarks are in the original issue -- #2305.

}
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ public class Options

public Options()
{
NgramLength = 1;
NgramLength = 2;
SkipLength = NgramExtractingEstimator.Defaults.SkipLength;
UseAllLengths = NgramExtractingEstimator.Defaults.UseAllLengths;
MaximumNgramsCount = new int[] { NgramExtractingEstimator.Defaults.MaximumNgramsCount };
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
maml.exe CV tr=AveragedPerceptron threads=- cali=PAV dout=%Output% data=%Data% seed=1
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Warning: Skipped 8 instances with missing features during training (over 1 iterations; 8 inst/iter)
Warning: Skipped 80 instances with missing features during training (over 10 iterations; 8 inst/iter)
Training calibrator.
PAV calibrator: piecewise function approximation has 6 components.
PAV calibrator: piecewise function approximation has 5 components.
Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.
Warning: Skipped 8 instances with missing features during training (over 1 iterations; 8 inst/iter)
Warning: Skipped 80 instances with missing features during training (over 10 iterations; 8 inst/iter)
Training calibrator.
PAV calibrator: piecewise function approximation has 6 components.
Warning: The predictor produced non-finite prediction values on 8 instances during testing. Possible causes: abnormal data or the predictor is numerically unstable.
Expand All @@ -13,43 +13,43 @@ Confusion table
||======================
PREDICTED || positive | negative | Recall
TRUTH ||======================
positive || 132 | 2 | 0.9851
positive || 133 | 1 | 0.9925
negative || 9 | 211 | 0.9591
||======================
Precision || 0.9362 | 0.9906 |
OVERALL 0/1 ACCURACY: 0.968927
Precision || 0.9366 | 0.9953 |
OVERALL 0/1 ACCURACY: 0.971751
LOG LOSS/instance: Infinity
Test-set entropy (prior Log-Loss/instance): 0.956998
LOG-LOSS REDUCTION (RIG): -Infinity
AUC: 0.992809
AUC: 0.994403
Warning: The predictor produced non-finite prediction values on 8 instances during testing. Possible causes: abnormal data or the predictor is numerically unstable.
TEST POSITIVE RATIO: 0.3191 (105.0/(105.0+224.0))
Confusion table
||======================
PREDICTED || positive | negative | Recall
TRUTH ||======================
positive || 102 | 3 | 0.9714
negative || 4 | 220 | 0.9821
positive || 100 | 5 | 0.9524
negative || 3 | 221 | 0.9866
||======================
Precision || 0.9623 | 0.9865 |
OVERALL 0/1 ACCURACY: 0.978723
LOG LOSS/instance: 0.239330
Precision || 0.9709 | 0.9779 |
OVERALL 0/1 ACCURACY: 0.975684
LOG LOSS/instance: 0.227705
Test-set entropy (prior Log-Loss/instance): 0.903454
LOG-LOSS REDUCTION (RIG): 0.735095
AUC: 0.997279
LOG-LOSS REDUCTION (RIG): 0.747961
AUC: 0.997619

OVERALL RESULTS
---------------------------------------
AUC: 0.995044 (0.0022)
Accuracy: 0.973825 (0.0049)
Positive precision: 0.949217 (0.0130)
Positive recall: 0.978252 (0.0068)
Negative precision: 0.988579 (0.0020)
Negative recall: 0.970617 (0.0115)
AUC: 0.996011 (0.0016)
Accuracy: 0.973718 (0.0020)
Positive precision: 0.953747 (0.0171)
Positive recall: 0.972459 (0.0201)
Negative precision: 0.986580 (0.0087)
Negative recall: 0.972849 (0.0138)
Log-loss: Infinity (NaN)
Log-loss reduction: -Infinity (NaN)
F1 Score: 0.963412 (0.0034)
AUPRC: 0.990172 (0.0037)
F1 Score: 0.962653 (0.0011)
AUPRC: 0.992269 (0.0025)

---------------------------------------
Physical memory usage(MB): %Number%
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
AveragedPerceptron
AUC Accuracy Positive precision Positive recall Negative precision Negative recall Log-loss Log-loss reduction F1 Score AUPRC Learner Name Train Dataset Test Dataset Results File Run Time Physical Memory Virtual Memory Command Line Settings
0.995044 0.973825 0.949217 0.978252 0.988579 0.970617 Infinity -Infinity 0.963412 0.990172 AveragedPerceptron %Data% %Output% 99 0 0 maml.exe CV tr=AveragedPerceptron threads=- cali=PAV dout=%Output% data=%Data% seed=1
0.996011 0.973718 0.953747 0.972459 0.98658 0.972849 Infinity -Infinity 0.962653 0.992269 AveragedPerceptron %Data% %Output% 99 0 0 maml.exe CV tr=AveragedPerceptron threads=- cali=PAV dout=%Output% data=%Data% seed=1

Loading