Skip to content

Perf benchmarks for optimization #711

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Zruty0 opened this issue Aug 22, 2018 · 2 comments
Closed

Perf benchmarks for optimization #711

Zruty0 opened this issue Aug 22, 2018 · 2 comments
Assignees
Labels
perf Performance and Benchmarking related
Milestone

Comments

@Zruty0
Copy link
Contributor

Zruty0 commented Aug 22, 2018

We want to have a couple ML scenarios to be used for tracking performance. We can use it to detect regression from build to build, as well as improve performance.

@Zruty0 Zruty0 added the perf Performance and Benchmarking related label Aug 22, 2018
@sfilipi sfilipi self-assigned this Aug 30, 2018
@shauheen shauheen added this to the 0918 milestone Aug 31, 2018
@justinormont
Copy link
Contributor

justinormont commented Sep 12, 2018

I defined a first round of benchmarks, and the implementation is in progress by @Anipik.

Tests in RSP form:

  • Text:
    • Bigram+Trichar:
      • OVA-AP: (Complete - Added Benchmark performance tests for wikidetoxData #820)
        maml.exe CV k=5 data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+} xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=TextTransform{col=FeaturesText:comment wordExtractor=NGramExtractorTransform{ngram=2}} xf=Concat{col=Features:FeaturesText,logged_in,ns} tr=OVA{p=AveragedPerceptron{iter=10}} out={0.model.zip}
      • LightGBM: (Complete - Added Benchmark performance tests for wikidetoxData #820)
        maml.exe CV k=5 data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+} xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=TextTransform{col=FeaturesText:comment wordExtractor=NGramExtractorTransform{ngram=2}} xf=Concat{col=Features:FeaturesText,logged_in,ns} tr=LightGBMMulticlass{} out={1.model.zip}
    • Word Embeddings: (this should be done in a Stacked Model using a TrainScore transform but this isn't available in ML.NET currently)
      • OVA-AP: (Complete - WordEmbedding Tests added plus added dimension check for the first row #880)
        maml.exe CV tr=OVA{p=AveragedPerceptron{iter=10}} k=5 loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+} data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv out=0.model.zip xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor=NGramExtractorTransform{ngram=2}} xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D} xf=Concat{col=Features:FeaturesText,FeaturesWordEmbedding,logged_in,ns}
      • SDCA: (Complete - WordEmbedding Tests added plus added dimension check for the first row #880)
        maml.exe CV tr=SDCAMC k=5 loader=TextLoader{quote=- sparse=- col=Label:R4:0 col=rev_id:TX:1 col=comment:TX:2 col=logged_in:BL:4 col=ns:TX:5 col=sample:TX:6 col=split:TX:7 col=year:R4:3 header=+} data=toxicity_annotated_comments.merged.shuf.cleaned-68MB,_160k-rows.tsv out=2.model.zip xf=Convert{col=logged_in type=R4} xf=CategoricalTransform{col=ns} xf=TextTransform{col=FeaturesText:comment tokens=+ wordExtractor={} charExtractor={}} xf=WordEmbeddingsTransform{col=FeaturesWordEmbedding:FeaturesText_TransformedText model=FastTextWikipedia300D} xf=Concat{col=Features:FeaturesWordEmbedding,logged_in,ns}
  • Ranking on Numeric:
    • FastTree: (Complete - Added numeric ranking Performance Tests #888)
      maml.exe TrainTest test=MSLR-WEB10K.VALIDATE.820MB_1.2M-rows.tsv eval=RankingEvaluator{t=10} data=MSLR-WEB10K.TRAIN.2.4GB_3.6M-rows.tsv loader=TextLoader{col=Label:R4:0 col=GroupId:TX:1 col=Features:R4:2-138} xf=HashTransform{col=GroupId} xf=NAHandleTransform{col=Features} tr=FastTreeRanking{} out={1.model.zip}
    • LightGBM: (Complete - Added numeric ranking Performance Tests #888)
      maml.exe TrainTest test=MSLR-WEB10K.VALIDATE.820MB_1.2M-rows.tsv eval=RankingEvaluator{t=10} data=MSLR-WEB10K.TRAIN.2.4GB_3.6M-rows.tsv loader=TextLoader{col=Label:R4:0 col=GroupId:TX:1 col=Features:R4:2-138} xf=HashTransform{col=GroupId} xf=NAHandleTransform{col=Features} tr=LightGBMRanking{} out={3.model.zip}
  • Scoring Speed:

@justinormont
Copy link
Contributor

The first round of performance (speed) benchmark are complete. Thanks to @Anipik for implementing the benchmarks. cc: @Zruty0, @GalOshri

The benchmarks are chosen to be representative of user tasks and datasets. The components to test were chosen based on component usage number from internal-MSFT, what folks should be using, and tasks commonly performed.

Components covered by these tests:

  • Learners
    • LightGBM Multiclass & LightGBM Ranking
    • OVA w/ AveragedPerceptron
    • SDCA Multiclass
    • FastTree Ranking
  • Transforms
    • Convert
    • CategoricalTransform
    • TextTransform, and its sub-transforms:
      • WordTokenizeTransform
      • TextNormalizerTransform
      • TermTransform
      • NgramTransform
      • CharTokenize
      • GcnTransform
      • DropColumnsTransform
    • Concat
    • WordEmbeddingsTransform
    • HashTransform
    • NAHandleTransform

Besides model training speed, bulk prediction/scoring speed is benchmarked for a text dataset w/ OVA-AveragedPerceptron, and a numeric ranking dataset w/ FastTree.

Notably absent components: (reasonable choices for the next round of benchmarks)

  • User scenarios
    • Single prediction (one instance at a time) latency & throughput, and time to first prediction
    • Component-wise tests to let devs focus on improving the speed of specific components
    • Very small datasets (where the overhead becomes apparent)
    • Scalability to high number of classes in multiclass classification (>=2000 classes)
    • Disk read & parse perf using a large dataset and disabled caching
    • Very complex models w/ heavy featurization and stacked models
    • Regression problems
    • Categorical datasets
  • Learners
    • BinaryClassificationGamTrainer
    • SymSGD
    • K-means
    • (wide variety as they hit this external repo)
  • Transforms
    • TrainScore for model stacking
    • PCATransform
    • WordHashBagTransform
    • CategoricalHashTransform
    • KeyToVector
    • BinNormalizer
    • CountFeatureSelection/LearnerFeatureSelectionTransform/MutualInformationFeatureSelection
    • TreeFeaturizationTransform
    • MinMaxNormalizer/MeanVarNormalizer
    • LdaTransform
    • (others as they hit this external repo)

The focus of these benchmarks were on speed; we can also use them to track ML metrics like accuracy across builds. This would require a much larger set of datasets as to not over-fit our improvements towards a specific dataset.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
perf Performance and Benchmarking related
Projects
None yet
Development

No branches or pull requests

4 participants