-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Changed default NGrams for FeaturizerText from 1 to 2 #5243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
689ada1
f4161f2
6f41bf8
a23af4f
f4396a1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -61,7 +61,7 @@ public abstract class OnlineLinearOptions : TrainerInputBaseWithLabel | |
[BestFriend] | ||
internal class OnlineDefault | ||
{ | ||
public const int NumberOfIterations = 1; | ||
public const int NumberOfIterations = 10; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Modifying within There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can do that. Question though, would it be bad to update it for them all? Or is just that we know for AveragedPerceptron 10 is better, but thats not true for the others? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the other trainers, it's unknown if 10 iterations will be better or worse. For most trainers, Since the existing defaults in the other trainers are backed by some level of benchmarking when they were created, I would keep with the existing value. Moving from 1 to 10 iterations, we know the runtime will increase and we should demonstrate the value (accuracy gain) of this extra time spent. For AveragedPerceptron, we did benchmarks showing the gain in having 10 iterations as the new default. Subset of the AveragedPerceptron benchmarks are in the original issue -- #2305. |
||
} | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,4 @@ | ||
AveragedPerceptron | ||
AUC Accuracy Positive precision Positive recall Negative precision Negative recall Log-loss Log-loss reduction F1 Score AUPRC Learner Name Train Dataset Test Dataset Results File Run Time Physical Memory Virtual Memory Command Line Settings | ||
0.995044 0.973825 0.949217 0.978252 0.988579 0.970617 Infinity -Infinity 0.963412 0.990172 AveragedPerceptron %Data% %Output% 99 0 0 maml.exe CV tr=AveragedPerceptron threads=- cali=PAV dout=%Output% data=%Data% seed=1 | ||
0.996011 0.973718 0.953747 0.972459 0.98658 0.972849 Infinity -Infinity 0.962653 0.992269 AveragedPerceptron %Data% %Output% 99 0 0 maml.exe CV tr=AveragedPerceptron threads=- cali=PAV dout=%Output% data=%Data% seed=1 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two smaller PRs will be easier to review.
Together, this is modifying 105 files by changing both the default ngram length, and AveragedPerceptron's default iterations.
For instance, the AveragedPerceptron change should only affect baseline outputs where AveragedPerceptron is used (including in ensembles). But I see some changes in SVM files, which shouldn't change due to the AP change, but might have a TextTransform in the pipeline. It is just hard to disentangle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am going to be breaking this up. We seem to have a memory issue with the NGram stuff, so I will be splitting these out to get these in while I work on that.
You said the SVM files shoudln't change, but might have a TextTransform in the pipeline. If they do have a text transform should they change? Or should we only see changes where AveragedPerceptron is used and no where else? Or, is this change a bad thing for svm as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expected output changes
This PR is changing both AveragedPerceptron and the text transform. If the pipeline has either of these components, I would expect the pipeline's output to change.
Any pipeline with neither AveragedPerceptron nor the text transform, should have the same output as before. Modifying the other trainers' iteration counts is not benchmarked and, by default, I would assume will not be beneficial.
Memory footprint
The memory footprint will increase and is absolutely worth it for the accuracy gains of our default pipeline. The pipeline being implemented in this PR, Bigrams+tricharactergrams w/ AveragedPerceptron{iter=10}, is the default in TLC and ML․NET AutoML for text and is highly tested/benchmarked. It's the result of around 50-150k runs on the combinatorics of a variety of datasets; primarily testing for ngram lengths, trainers, and hyperparameters. This pipeline was found to be an optimal default. Other featurization options help on specific datasets (see: list).
The memory increase will occur in a few places, (1) in the text transform's dictionary where it stores all seen ngrams, (2) in the iDataView cache if used, (3) in tree trainers at the dataset transpose step for row-wise to column-wise, and (4) in high multi-class linear trainers which store the weight for each feature times each class.
In the disentangled PR handing the ngram change, we should dig into any memory issues which arise. It may be a limitation of our build/testing platform.
Perhaps in the future we should write a doc on tips for low memory model training and serving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed responses @justinormont.
I will check the files that were changed again and make sure that they all have either AveragedPerceptron or a text transform. If I find any that don't I will change the location where I am overwriting that value.
I think I have the memory increase issue figured out, but I am still looking into it. Yes, the memory should go up, but it was exploding way faster then it should. A 70 Kb file (250 lines) was exploding into 1.7 GB per Ngram length during training. So when the NGram length was 2, it was using 3.4 GB for that small file. It appears to be because we have implemented it so it creates all the possible memory upfront, whether it is being used or not. I have changed that and the memory usage is much smaller now. Still verifying that my changes had no impact on any other tests though.