Perf benchmarks for optimization #711

Zruty0 · 2018-08-22T17:15:33Z

We want to have a couple ML scenarios to be used for tracking performance. We can use it to detect regression from build to build, as well as improve performance.

justinormont · 2018-09-12T18:23:23Z

I defined a first round of benchmarks, and the implementation is in progress by @Anipik.

Tests in RSP form:

Text:
- Bigram+Trichar:
  - OVA-AP: (Complete - Added Benchmark performance tests for wikidetoxData #820)
    maml.exe CV k=5 data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+} xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=TextTransform{col=FeaturesText:comment wordExtractor=NGramExtractorTransform{ngram=2}} xf=Concat{col=Features:FeaturesText,logged_in,ns} tr=OVA{p=AveragedPerceptron{iter=10}} out={0.model.zip}
  - LightGBM: (Complete - Added Benchmark performance tests for wikidetoxData #820)
    maml.exe CV k=5 data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+} xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=TextTransform{col=FeaturesText:comment wordExtractor=NGramExtractorTransform{ngram=2}} xf=Concat{col=Features:FeaturesText,logged_in,ns} tr=LightGBMMulticlass{} out={1.model.zip}
- Word Embeddings: (this should be done in a Stacked Model using a TrainScore transform but this isn't available in ML.NET currently)
  - OVA-AP: (Complete - WordEmbedding Tests added plus added dimension check for the first row #880)
    maml.exe CV tr=OVA{p=AveragedPerceptron{iter=10}} k=5 loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+} data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv out=0.model.zip xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor=NGramExtractorTransform{ngram=2}} xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D} xf=Concat{col=Features:FeaturesText,FeaturesWordEmbedding,logged_in,ns}
  - SDCA: (Complete - WordEmbedding Tests added plus added dimension check for the first row #880)
    maml.exe CV tr=SDCAMC k=5 loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+} data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv out=2.model.zip xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor={} charExtractor={}} xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D} xf=Concat{col=Features:FeaturesWordEmbedding,logged_in,ns}
Ranking on Numeric:
- FastTree: (Complete - Added numeric ranking Performance Tests #888)
  maml.exe TrainTest test=MSLR-WEB10K.VALIDATE.820MB_1.2M-rows.tsv eval=RankingEvaluator{t=10} data=MSLR-WEB10K.TRAIN.2.4GB_3.6M-rows.tsv loader=TextLoader{col=Label:R4:0 col=GroupId:TX:1 col=Features:R4:2-138} xf=HashTransform{col=GroupId} xf=NAHandleTransform{col=Features} tr=FastTreeRanking{} out={1.model.zip}
- LightGBM: (Complete - Added numeric ranking Performance Tests #888)
  maml.exe TrainTest test=MSLR-WEB10K.VALIDATE.820MB_1.2M-rows.tsv eval=RankingEvaluator{t=10} data=MSLR-WEB10K.TRAIN.2.4GB_3.6M-rows.tsv loader=TextLoader{col=Label:R4:0 col=GroupId:TX:1 col=Features:R4:2-138} xf=HashTransform{col=GroupId} xf=NAHandleTransform{col=Features} tr=LightGBMRanking{} out={3.model.zip}
Scoring Speed:
- Text w/ Bigram+Trichar model w/ AP from above: (Complete - Added Benchmark performance tests for wikidetoxData #820)
  maml.exe Test data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv in=0.model.fold000.zip
  Note: output accuracy for this one case isn't relevant as we're testing on the training set.
- Ranking Numeric from w/ FT above: (Complete - Added numeric ranking Performance Tests #888)
  maml.exe Test data=MSLR-WEB10K.TEST.820MB_1.2M-rows.tsv in=1.model.zip

justinormont · 2018-09-19T22:02:59Z

The first round of performance (speed) benchmark are complete. Thanks to @Anipik for implementing the benchmarks. cc: @Zruty0, @GalOshri

The benchmarks are chosen to be representative of user tasks and datasets. The components to test were chosen based on component usage number from internal-MSFT, what folks should be using, and tasks commonly performed.

Components covered by these tests:

Learners
- LightGBM Multiclass & LightGBM Ranking
- OVA w/ AveragedPerceptron
- SDCA Multiclass
- FastTree Ranking
Transforms
- Convert
- CategoricalTransform
- TextTransform, and its sub-transforms:
  - WordTokenizeTransform
  - TextNormalizerTransform
  - TermTransform
  - NgramTransform
  - CharTokenize
  - GcnTransform
  - DropColumnsTransform
- Concat
- WordEmbeddingsTransform
- HashTransform
- NAHandleTransform

Besides model training speed, bulk prediction/scoring speed is benchmarked for a text dataset w/ OVA-AveragedPerceptron, and a numeric ranking dataset w/ FastTree.

Notably absent components: (reasonable choices for the next round of benchmarks)

User scenarios
- Single prediction (one instance at a time) latency & throughput, and time to first prediction
- Component-wise tests to let devs focus on improving the speed of specific components
- Very small datasets (where the overhead becomes apparent)
- Scalability to high number of classes in multiclass classification (>=2000 classes)
- Disk read & parse perf using a large dataset and disabled caching
- Very complex models w/ heavy featurization and stacked models
- Regression problems
- Categorical datasets
Learners
- BinaryClassificationGamTrainer
- SymSGD
- K-means
- (wide variety as they hit this external repo)
Transforms
- TrainScore for model stacking
- PCATransform
- WordHashBagTransform
- CategoricalHashTransform
- KeyToVector
- BinNormalizer
- CountFeatureSelection/LearnerFeatureSelectionTransform/MutualInformationFeatureSelection
- TreeFeaturizationTransform
- MinMaxNormalizer/MeanVarNormalizer
- LdaTransform
- (others as they hit this external repo)

The focus of these benchmarks were on speed; we can also use them to track ML metrics like accuracy across builds. This would require a much larger set of datasets as to not over-fit our improvements towards a specific dataset.

Zruty0 added the perf Performance and Benchmarking related label Aug 22, 2018

Zruty0 assigned justinormont Aug 22, 2018

eerhardt mentioned this issue Aug 22, 2018

Port all relevant AVX hardware intrinsics C# APIs from SIMD native algorithms #691

Merged

sfilipi self-assigned this Aug 30, 2018

shauheen added this to the 0918 milestone Aug 31, 2018

justinormont closed this as completed Sep 19, 2018

justinormont mentioned this issue Sep 25, 2018

Add Benchmark test for PredictionEngine #1014

Merged

ghost locked as resolved and limited conversation to collaborators Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf benchmarks for optimization #711

Perf benchmarks for optimization #711

Zruty0 commented Aug 22, 2018

justinormont commented Sep 12, 2018 •

edited

Loading

justinormont commented Sep 19, 2018

Perf benchmarks for optimization #711

Perf benchmarks for optimization #711

Comments

Zruty0 commented Aug 22, 2018

justinormont commented Sep 12, 2018 • edited Loading

justinormont commented Sep 19, 2018

justinormont commented Sep 12, 2018 •

edited

Loading