Skip to content

Modify API for advanced settings (RandomizedPcaTrainer) #2390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 13, 2019
38 changes: 38 additions & 0 deletions src/Microsoft.ML.Data/Evaluators/AnomalyDetectionEvaluator.cs
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
using Microsoft.ML;
using Microsoft.ML.CommandLine;
using Microsoft.ML.Data;
using Microsoft.ML.Data.Evaluators.Metrics;
using Microsoft.ML.EntryPoints;
using Microsoft.ML.Internal.Utilities;
using Microsoft.ML.Transforms;
Expand Down Expand Up @@ -576,6 +577,43 @@ public void Finish()
FinishOtherMetrics();
}
}

/// <summary>
/// Evaluates scored anomaly detection data.
/// </summary>
/// <param name="data">The scored data.</param>
/// <param name="label">The name of the label column in <paramref name="data"/>.</param>
/// <param name="score">The name of the score column in <paramref name="data"/>.</param>
/// <param name="predictedLabel">The name of the predicted label column in <paramref name="data"/>.</param>
/// <returns>The evaluation results for these outputs.</returns>
public AnomalyDetectionMetrics Evaluate(IDataView data, string label, string score, string predictedLabel)
Copy link
Contributor

@rogancarr rogancarr Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Defaults for label, score, and predictedLabel? #Resolved

{
Host.CheckValue(data, nameof(data));
Host.CheckNonEmpty(label, nameof(label));
Host.CheckNonEmpty(score, nameof(score));
Host.CheckNonEmpty(predictedLabel, nameof(predictedLabel));

var roles = new RoleMappedData(data, opt: false,
RoleMappedSchema.ColumnRole.Label.Bind(label),
RoleMappedSchema.CreatePair(MetadataUtils.Const.ScoreValueKind.Score, score),
RoleMappedSchema.CreatePair(MetadataUtils.Const.ScoreValueKind.PredictedLabel, predictedLabel));

var resultDict = ((IEvaluator)this).Evaluate(roles);
Host.Assert(resultDict.ContainsKey(MetricKinds.OverallMetrics));
var overall = resultDict[MetricKinds.OverallMetrics];

AnomalyDetectionMetrics result;
using (var cursor = overall.GetRowCursorForAllColumns())
{
var moved = cursor.MoveNext();
Host.Assert(moved);
result = new AnomalyDetectionMetrics(Host, cursor);
moved = cursor.MoveNext();
Host.Assert(!moved);
}
return result;
}

}

[BestFriend]
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System;
using Microsoft.Data.DataView;

namespace Microsoft.ML.Data.Evaluators.Metrics
{
public sealed class AnomalyDetectionMetrics
{
public double Auc { get; }
Copy link
Contributor

@rogancarr rogancarr Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summaries, Remarks, and links to relevant documentation. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added basic summaries for now.

wanted to also add the remarks from TLC website., but the explanations there were not clear esp. for the detection rate metrics.


In reply to: 255583277 [](ancestors = 255583277)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these summaries, check in with @shmoradims ; he's building a set of generic docs for things like AUC, F1, RMSE, etc.


In reply to: 255703503 [](ancestors = 255703503,255583277)

public double DrAtK { get; }
public double DrAtPFpr { get; }
public double DrAtNumPos { get; }
public double NumAnomalies { get; }
public double ThreshAtK { get; }
public double ThreshAtP { get; }
public double ThreshAtNumPos { get; }

internal AnomalyDetectionMetrics(IExceptionContext ectx, Row overallResult)
{
long FetchInt(string name) => RowCursorUtils.Fetch<long>(ectx, overallResult, name);
float FetchFloat(string name) => RowCursorUtils.Fetch<float>(ectx, overallResult, name);
double FetchDouble(string name) => RowCursorUtils.Fetch<double>(ectx, overallResult, name);

Auc = FetchDouble(BinaryClassifierEvaluator.Auc);
DrAtK = FetchDouble(AnomalyDetectionEvaluator.OverallMetrics.DrAtK);
DrAtPFpr = FetchDouble(AnomalyDetectionEvaluator.OverallMetrics.DrAtPFpr);
DrAtNumPos = FetchDouble(AnomalyDetectionEvaluator.OverallMetrics.DrAtNumPos);
NumAnomalies = FetchInt(AnomalyDetectionEvaluator.OverallMetrics.NumAnomalies);
ThreshAtK = FetchFloat(AnomalyDetectionEvaluator.OverallMetrics.ThreshAtK);
ThreshAtP = FetchFloat(AnomalyDetectionEvaluator.OverallMetrics.ThreshAtP);
ThreshAtNumPos = FetchFloat(AnomalyDetectionEvaluator.OverallMetrics.ThreshAtNumPos);
}
}
}
6 changes: 6 additions & 0 deletions src/Microsoft.ML.Data/MLContext.cs
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,11 @@ public sealed class MLContext : IHostEnvironment
/// </summary>
public RankingCatalog Ranking { get; }

/// <summary>
/// Trainers and tasks specific to anomaly detection problems.
/// </summary>
public AnomalyDetectionCatalog AnomalyDetection { get; }

/// <summary>
/// Data processing operations.
/// </summary>
Expand Down Expand Up @@ -83,6 +88,7 @@ public MLContext(int? seed = null, int conc = 0)
Regression = new RegressionCatalog(_env);
Clustering = new ClusteringCatalog(_env);
Ranking = new RankingCatalog(_env);
AnomalyDetection = new AnomalyDetectionCatalog(_env);
Transforms = new TransformsCatalog(_env);
Model = new ModelOperationsCatalog(_env);
Data = new DataOperationsCatalog(_env);
Expand Down
48 changes: 48 additions & 0 deletions src/Microsoft.ML.Data/TrainCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
using Microsoft.Data.DataView;
using Microsoft.ML.Core.Data;
using Microsoft.ML.Data;
using Microsoft.ML.Data.Evaluators.Metrics;
using Microsoft.ML.Transforms;
using Microsoft.ML.Transforms.Conversions;

Expand Down Expand Up @@ -564,4 +565,51 @@ public RankerMetrics Evaluate(IDataView data, string label, string groupId, stri
return eval.Evaluate(data, label, groupId, score);
}
}

/// <summary>
/// The central catalog for anomaly detection tasks and trainers.
/// </summary>
public sealed class AnomalyDetectionCatalog : TrainCatalogBase
{
/// <summary>
/// The list of trainers for anomaly detection.
/// </summary>
public AnomalyDetectionTrainers Trainers { get; }

internal AnomalyDetectionCatalog(IHostEnvironment env)
: base(env, nameof(AnomalyDetectionCatalog))
{
Trainers = new AnomalyDetectionTrainers(this);
}

public sealed class AnomalyDetectionTrainers : CatalogInstantiatorBase
{
internal AnomalyDetectionTrainers(AnomalyDetectionCatalog catalog)
: base(catalog)
{
}
}

/// <summary>
/// Evaluates scored anomaly detection data.
/// </summary>
/// <param name="data">The scored data.</param>
/// <param name="label">The name of the label column in <paramref name="data"/>.</param>
/// <param name="score">The name of the score column in <paramref name="data"/>.</param>
/// <param name="predictedLabel">The name of the predicted label column in <paramref name="data"/>.</param>
/// <returns>The evaluation results for these calibrated outputs.</returns>
Copy link
Contributor

@artidoro artidoro Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calibrated [](start = 54, length = 10)

What is calibrated here? Could you explain in a few more words? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was a copy-paste typo. fixed...


In reply to: 255717912 [](ancestors = 255717912)

public AnomalyDetectionMetrics Evaluate(IDataView data, string label = DefaultColumnNames.Label, string score = DefaultColumnNames.Score,
string predictedLabel = DefaultColumnNames.PredictedLabel)
{
Host.CheckValue(data, nameof(data));
Host.CheckNonEmpty(label, nameof(label));
Host.CheckNonEmpty(score, nameof(score));
Host.CheckNonEmpty(predictedLabel, nameof(predictedLabel));

var args = new AnomalyDetectionEvaluator.Arguments();

var eval = new AnomalyDetectionEvaluator(Host, args);
return eval.Evaluate(data, label, score, predictedLabel);
}
}
}
23 changes: 22 additions & 1 deletion src/Microsoft.ML.PCA/PCACatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
// See the LICENSE file in the project root for more information.

using Microsoft.ML.Data;
using Microsoft.ML.Trainers.PCA;
using Microsoft.ML.Transforms.Projections;
using static Microsoft.ML.Trainers.PCA.RandomizedPcaTrainer;

namespace Microsoft.ML
{
public static class PcaCatalog
{

/// <summary>Initializes a new instance of <see cref="PrincipalComponentAnalysisEstimator"/>.</summary>
/// <param name="catalog">The transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
Expand All @@ -35,5 +36,25 @@ public static PrincipalComponentAnalysisEstimator ProjectToPrincipalComponents(t
/// <param name="columns">Input columns to apply PrincipalComponentAnalysis on.</param>
public static PrincipalComponentAnalysisEstimator ProjectToPrincipalComponents(this TransformsCatalog.ProjectionTransforms catalog, params PrincipalComponentAnalysisEstimator.ColumnInfo[] columns)
=> new PrincipalComponentAnalysisEstimator(CatalogUtils.GetEnvironment(catalog), columns);

public static RandomizedPcaTrainer RandomizedPca(this AnomalyDetectionCatalog.AnomalyDetectionTrainers catalog,
Copy link
Contributor

@rogancarr rogancarr Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs xml docs with remarks and links to a sample. Here or add to #1209 . #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added xml docs.

sample adding can be part of overall documentation effort #1209


In reply to: 255585449 [](ancestors = 255585449)

string featureColumn = DefaultColumnNames.Features,
string weights = null,
int rank = 20,
int oversampling = 20,
bool center = true,
int? seed = null)
{
Contracts.CheckValue(catalog, nameof(catalog));
var env = CatalogUtils.GetEnvironment(catalog);
return new RandomizedPcaTrainer(env, featureColumn, weights, rank, oversampling, center, seed);
}

public static RandomizedPcaTrainer RandomizedPca(this AnomalyDetectionCatalog.AnomalyDetectionTrainers catalog, Options options)
{
Contracts.CheckValue(catalog, nameof(catalog));
var env = CatalogUtils.GetEnvironment(catalog);
return new RandomizedPcaTrainer(env, options);
}
}
}
26 changes: 13 additions & 13 deletions src/Microsoft.ML.PCA/PcaTrainer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
using Microsoft.ML.Trainers.PCA;
using Microsoft.ML.Training;

[assembly: LoadableClass(RandomizedPcaTrainer.Summary, typeof(RandomizedPcaTrainer), typeof(RandomizedPcaTrainer.Arguments),
[assembly: LoadableClass(RandomizedPcaTrainer.Summary, typeof(RandomizedPcaTrainer), typeof(RandomizedPcaTrainer.Options),
new[] { typeof(SignatureAnomalyDetectorTrainer), typeof(SignatureTrainer) },
RandomizedPcaTrainer.UserNameValue,
RandomizedPcaTrainer.LoadNameValue,
Expand Down Expand Up @@ -49,7 +49,7 @@ public sealed class RandomizedPcaTrainer : TrainerEstimatorBase<AnomalyPredictio
internal const string Summary = "This algorithm trains an approximate PCA using Randomized SVD algorithm. "
+ "This PCA can be made into Kernel PCA by using Random Fourier Features transform.";

public class Arguments : UnsupervisedLearnerInputBaseWithWeight
public class Options : UnsupervisedLearnerInputBaseWithWeight
Copy link
Member

@sfilipi sfilipi Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Options [](start = 21, length = 7)

xml docs are coming later? #Pending

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap.


In reply to: 254353429 [](ancestors = 254353429)

{
[Argument(ArgumentType.AtMostOnce, HelpText = "The number of components in the PCA", ShortName = "k", SortOrder = 50)]
[TGUI(SuggestedSweeps = "10,20,40,80")]
Expand Down Expand Up @@ -91,7 +91,7 @@ public class Arguments : UnsupervisedLearnerInputBaseWithWeight
/// <param name="oversampling">Oversampling parameter for randomized PCA training.</param>
/// <param name="center">If enabled, data is centered to be zero mean.</param>
/// <param name="seed">The seed for random number generation.</param>
public RandomizedPcaTrainer(IHostEnvironment env,
internal RandomizedPcaTrainer(IHostEnvironment env,
Copy link
Contributor

@rogancarr rogancarr Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we just make the class Internal/BestFriend and keep this public? #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really.

we would want to expose these through mlcontext. not via constructors


In reply to: 254391018 [](ancestors = 254391018)

Copy link
Contributor

@rogancarr rogancarr Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved. I hadn't seen the pattern for trainable transforms where the class is public and methods are internal. #Resolved

Copy link
Member Author

@abgoswam abgoswam Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe in ML.NET terms, this is "trainer estimator" (for anomaly detection tasks)

most other "trainer estimator"s follow the same pattern e.g. KMeansPlusPlusTrainer


In reply to: 255587702 [](ancestors = 255587702)

string features,
string weights = null,
int rank = 20,
Expand All @@ -103,23 +103,23 @@ public RandomizedPcaTrainer(IHostEnvironment env,

}

internal RandomizedPcaTrainer(IHostEnvironment env, Arguments args)
:this(env, args, args.FeatureColumn, args.WeightColumn)
internal RandomizedPcaTrainer(IHostEnvironment env, Options options)
Copy link
Contributor

@artidoro artidoro Feb 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RandomizedPcaTrainer [](start = 17, length = 20)

It's strange... I noticed that renaming Arguments to Options did not modify anything in the mlContext catalog. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked it up, and I don't think there is an entry for this trainer in mlContext. Can you add it?


In reply to: 253319239 [](ancestors = 253319239)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. i noticed couple more components which do not have mlcontext extension.

will add


In reply to: 253319255 [](ancestors = 253319255,253319239)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added mlcontext extension for this. Also added a test for it that exercises the Fit() and Transform() APIs.

Evaluate() API currently missing from Anomaly Detection. i will create a separate issue for that.


In reply to: 253584603 [](ancestors = 253584603,253319255,253319239)

:this(env, options, options.FeatureColumn, options.WeightColumn)
{

}

private RandomizedPcaTrainer(IHostEnvironment env, Arguments args, string featureColumn, string weightColumn,
private RandomizedPcaTrainer(IHostEnvironment env, Options options, string featureColumn, string weightColumn,
int rank = 20, int oversampling = 20, bool center = true, int? seed = null)
: base(Contracts.CheckRef(env, nameof(env)).Register(LoadNameValue), TrainerUtils.MakeR4VecFeature(featureColumn), default, TrainerUtils.MakeR4ScalarWeightColumn(weightColumn))
{
// if the args are not null, we got here from maml, and the internal ctor.
if (args != null)
if (options != null)
{
_rank = args.Rank;
_center = args.Center;
_oversampling = args.Oversampling;
_seed = args.Seed ?? Host.Rand.Next();
_rank = options.Rank;
_center = options.Center;
_oversampling = options.Oversampling;
_seed = options.Seed ?? Host.Rand.Next();
}
else
{
Expand Down Expand Up @@ -347,14 +347,14 @@ protected override AnomalyPredictionTransformer<PcaModelParameters> MakeTransfor
Desc = "Train an PCA Anomaly model.",
UserName = UserNameValue,
ShortName = ShortName)]
internal static CommonOutputs.AnomalyDetectionOutput TrainPcaAnomaly(IHostEnvironment env, Arguments input)
internal static CommonOutputs.AnomalyDetectionOutput TrainPcaAnomaly(IHostEnvironment env, Options input)
Copy link
Contributor

@rogancarr rogancarr Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above; these can be kept public and the whole class can be made internal/BestFriend. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Options class above should be public. Hence we cannot make entire class internal.


In reply to: 254391534 [](ancestors = 254391534)

{
Contracts.CheckValue(env, nameof(env));
var host = env.Register("TrainPCAAnomaly");
host.CheckValue(input, nameof(input));
EntryPointUtils.CheckInputArgs(host, input);

return LearnerEntryPointsUtils.Train<Arguments, CommonOutputs.AnomalyDetectionOutput>(host, input,
return LearnerEntryPointsUtils.Train<Options, CommonOutputs.AnomalyDetectionOutput>(host, input,
() => new RandomizedPcaTrainer(host, input),
getWeight: () => LearnerEntryPointsUtils.FindColumn(host, input.TrainingData.Schema, input.WeightColumn));
}
Expand Down
1 change: 1 addition & 0 deletions src/Microsoft.ML.PCA/Properties/AssemblyInfo.cs
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
using System.Runtime.CompilerServices;
using Microsoft.ML;

[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.Tests" + PublicKey.TestValue)]
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.StaticPipe" + PublicKey.Value)]
[assembly: InternalsVisibleTo(assemblyName: "Microsoft.ML.Core.Tests" + PublicKey.TestValue)]

Expand Down
2 changes: 1 addition & 1 deletion test/BaselineOutput/Common/EntryPoints/core_ep-list.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ Trainers.LogisticRegressionClassifier Logistic Regression is a method in statist
Trainers.NaiveBayesClassifier Train a MultiClassNaiveBayesTrainer. Microsoft.ML.Trainers.MultiClassNaiveBayesTrainer TrainMultiClassNaiveBayesTrainer Microsoft.ML.Trainers.MultiClassNaiveBayesTrainer+Arguments Microsoft.ML.EntryPoints.CommonOutputs+MulticlassClassificationOutput
Trainers.OnlineGradientDescentRegressor Train a Online gradient descent perceptron. Microsoft.ML.Trainers.Online.OnlineGradientDescentTrainer TrainRegression Microsoft.ML.Trainers.Online.OnlineGradientDescentTrainer+Options Microsoft.ML.EntryPoints.CommonOutputs+RegressionOutput
Trainers.OrdinaryLeastSquaresRegressor Train an OLS regression model. Microsoft.ML.Trainers.HalLearners.OlsLinearRegressionTrainer TrainRegression Microsoft.ML.Trainers.HalLearners.OlsLinearRegressionTrainer+Options Microsoft.ML.EntryPoints.CommonOutputs+RegressionOutput
Trainers.PcaAnomalyDetector Train an PCA Anomaly model. Microsoft.ML.Trainers.PCA.RandomizedPcaTrainer TrainPcaAnomaly Microsoft.ML.Trainers.PCA.RandomizedPcaTrainer+Arguments Microsoft.ML.EntryPoints.CommonOutputs+AnomalyDetectionOutput
Trainers.PcaAnomalyDetector Train an PCA Anomaly model. Microsoft.ML.Trainers.PCA.RandomizedPcaTrainer TrainPcaAnomaly Microsoft.ML.Trainers.PCA.RandomizedPcaTrainer+Options Microsoft.ML.EntryPoints.CommonOutputs+AnomalyDetectionOutput
Trainers.PoissonRegressor Train an Poisson regression model. Microsoft.ML.Trainers.PoissonRegression TrainRegression Microsoft.ML.Trainers.PoissonRegression+Options Microsoft.ML.EntryPoints.CommonOutputs+RegressionOutput
Trainers.StochasticDualCoordinateAscentBinaryClassifier Train an SDCA binary model. Microsoft.ML.Trainers.Sdca TrainBinary Microsoft.ML.Trainers.SdcaBinaryTrainer+Options Microsoft.ML.EntryPoints.CommonOutputs+BinaryClassificationOutput
Trainers.StochasticDualCoordinateAscentClassifier The SDCA linear multi-class classification trainer. Microsoft.ML.Trainers.Sdca TrainMultiClass Microsoft.ML.Trainers.SdcaMultiClassTrainer+Options Microsoft.ML.EntryPoints.CommonOutputs+MulticlassClassificationOutput
Expand Down
8 changes: 4 additions & 4 deletions test/BaselineOutput/SingleDebug/Rff/featurized.tsv
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
#@ col=RffVectorFloat:R4:9-14
#@ }
15 8:Label
5 1 1 1 2 1 3 1 0 -0.157029659 -0.555585265 0.490177631 -0.305056125 0.35670203 -0.453979075
5 4 4 5 7 10 3 2 0 -0.375955045 0.43816793 -0.5670244 0.108704455 -0.271485656 -0.5095379
3 1 1 1 2 2 3 1 0 -0.08380841 -0.571235061 0.4856296 -0.312245429 0.389987826 -0.4257262
6 8 8 1 3 4 3 7 0 -0.2813567 0.504154444 -0.266616732 -0.512102365 -0.5723418 -0.07588247
5 1 1 1 2 1 3 1 0 0.494028777 0.298778981 0.533874154 -0.219799235 0.5505202 -0.173956484
5 4 4 5 7 10 3 2 0 0.363161922 -0.4488282 0.5746647 -0.0556217246 0.276754946 0.5066952
3 1 1 1 2 2 3 1 0 0.5167663 0.257460445 0.5747438 -0.05479813 0.5746628 0.0556417964
6 8 8 1 3 4 3 7 0 -0.5738443 -0.0635295138 0.3556944 -0.454768956 -0.4006303 -0.415726721
2 changes: 1 addition & 1 deletion test/Microsoft.ML.Core.Tests/UnitTests/TestEntryPoints.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3416,7 +3416,7 @@ public void EntryPointPcaPredictorSummary()
InputFile = inputFile,
}).Data;

var pcaInput = new RandomizedPcaTrainer.Arguments
var pcaInput = new RandomizedPcaTrainer.Options
{
TrainingData = dataView,
};
Expand Down
67 changes: 67 additions & 0 deletions test/Microsoft.ML.Tests/AnomalyDetectionTests.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System;
using System.Drawing;
using System.Drawing.Imaging;
using System.IO;
using System.Linq;
using Microsoft.Data.DataView;
using Microsoft.ML.Data;
using Microsoft.ML.ImageAnalytics;
using Microsoft.ML.Model;
using Microsoft.ML.RunTests;
using Xunit;
using Xunit.Abstractions;

namespace Microsoft.ML.Tests
{
public class AnomalyDetectionTests : TestDataPipeBase
{
public AnomalyDetectionTests(ITestOutputHelper output) : base(output)
{
}

/// <summary>
/// RandomizedPcaTrainer test
/// </summary>
[Fact]
public void RandomizedPcaTrainer()
Copy link
Contributor

@rogancarr rogancarr Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RandomizedPcaTrainer [](start = 20, length = 20)

RandomizedPcaTrainerBaselineTest #Resolved

{
var mlContext = new MLContext(seed: 1, conc: 1);
string featureColumn = "NumericFeatures";

var reader = new TextLoader(Env, new TextLoader.Arguments()
{
HasHeader = true,
Separator = "\t",
Columns = new[]
{
new TextLoader.Column("Label", DataKind.R4, 0),
new TextLoader.Column(featureColumn, DataKind.R4, new [] { new TextLoader.Range(1, 784) })
}
});

var trainData = reader.Read(GetDataPath(TestDatasets.mnistOneClass.trainFilename));
var testData = reader.Read(GetDataPath(TestDatasets.mnistOneClass.testFilename));

var pipeline = ML.AnomalyDetection.Trainers.RandomizedPca(featureColumn);

var transformer = pipeline.Fit(trainData);
var transformedData = transformer.Transform(testData);

// Evaluate
var metrics = ML.AnomalyDetection.Evaluate(transformedData);

Assert.Equal(0.99, metrics.Auc, 2);
Copy link
Contributor

@rogancarr rogancarr Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know that these numbers are correct? #Resolved

Copy link
Member Author

@abgoswam abgoswam Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i tried out the same dataset in TLC with the same trainer, the numbers are close. Not exact hough

in general, in this PR i am only exposing the trainer / evaluators as they exist currently in the codebase. the PR does not have any algorithmic changes or changes in evaluation metrics themselves.


In reply to: 255589211 [](ancestors = 255589211)

Copy link
Contributor

@rogancarr rogancarr Feb 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the big question is, what do we want to test here?

  • Do we want a baseline test to make sure future versions don't change the numerics? (e.g. AUC is always 0.99 with 2 decimals of precision?
  • Do we want a functionality test, like we have in Functional.Tests? (e.g. check that 0 <= AUC <= 1)
  • Do we want correctness tests? (e.g. We know what the answer should be and we want to make sure this matches it.)

If we do want a baseline test, can we mark it as such, and check to further decimal places?

As an aside, are there correctness tests on these metrics that we can migrate from the internal repo? If so, can you file it as an issue to be done later?)


In reply to: 255680814 [](ancestors = 255680814,255589211)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

as per the classification above, these seem like "baseline" tests, I have increased the precision to 5 places of decimal.

as for test migration from internal repo, it seems we ported over the PcaAnomalyTest already . that should suffice for correctness.


In reply to: 255796608 [](ancestors = 255796608,255680814,255589211)

Assert.Equal(0.90, metrics.DrAtK, 2);
Copy link
Contributor

@rogancarr rogancarr Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert.Equal(0.90, metrics.DrAtK, 2); [](start = 11, length = 38)

This one @ 5 places too, please :) #Resolved

Assert.Equal(0.90, metrics.DrAtPFpr, 2);
Assert.Equal(0.90, metrics.DrAtNumPos, 2);
Assert.Equal(10, metrics.NumAnomalies);
Assert.Equal(0.57, metrics.ThreshAtK, 2);
Assert.Equal(0.63, metrics.ThreshAtP, 2);
Assert.Equal(0.65, metrics.ThreshAtNumPos, 2);
}
}
}