Skip to content

More Normalizer Scrubbing #2888

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Mar 14, 2019
6 changes: 3 additions & 3 deletions docs/code/MlNetCookBook.md
Original file line number Diff line number Diff line change
Expand Up @@ -595,7 +595,7 @@ As a general rule, *if you use a parametric learner, you need to make sure your

ML.NET offers several built-in scaling algorithms, or 'normalizers':
- MinMax normalizer: for each feature, we learn the minimum and maximum value of it, and then linearly rescale it so that the values fit between -1 and 1.
- MeanVar normalizer: for each feature, compute the mean and variance, and then linearly rescale it to zero-mean, unit-variance.
- MeanVariance normalizer: for each feature, compute the mean and variance, and then linearly rescale it to zero-mean, unit-variance.
Copy link
Contributor

@rogancarr rogancarr Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MeanVariance [](start = 2, length = 12)

In the field, we usually use the term Standardize to reflect this normalization technique. This is very "statistics-y", but it does seem to be standard. How would everyone feel about changing "MeanVariance" to "Standardize", or at least offer a "Standardize" alias? #Resolved

Copy link
Member Author

@wschin wschin Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel MVN is super bad because it's already an operator in neural networks (e.g., ONNX, Caffe, CoreML).
Just for references:
https://apple.github.io/coremltools/coremlspecification/sections/NeuralNetwork.html#meanvariancenormalizelayerparams
https://github.com/onnx/onnx/blob/master/docs/Operators.md#MeanVarianceNormalization


In reply to: 264826182 [](ancestors = 264826182)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Plus it's more precise than Standardize. Let's keep it MVN. We can always add an alias for Standardize if people are up in arms.


In reply to: 264891367 [](ancestors = 264891367,264826182)

- CDF normalizer: for each feature, compute the mean and variance, and then replace each value `x` with `Cdf(x)`, where `Cdf` is the cumulative density function of normal distribution with these mean and variance.
- Binning normalizer: discretize the feature value into `N` 'buckets', and then replace each value with the index of the bucket, divided by `N-1`.

Expand Down Expand Up @@ -630,8 +630,8 @@ var trainData = mlContext.Data.LoadFromTextFile<IrisInputAllFeatures>(dataPath,
var pipeline =
mlContext.Transforms.Normalize(
new NormalizingEstimator.MinMaxColumnOptions("MinMaxNormalized", "Features", fixZero: true),
new NormalizingEstimator.MeanVarColumnOptions("MeanVarNormalized", "Features", fixZero: true),
new NormalizingEstimator.BinningColumnOptions("BinNormalized", "Features", numBins: 256));
new NormalizingEstimator.MeanVarianceColumnOptions("MeanVarNormalized", "Features", fixZero: true),
new NormalizingEstimator.BinningColumnOptions("BinNormalized", "Features", maximumBinCount: 256));

// Let's train our pipeline of normalizers, and then apply it to the same data.
var normalizedData = pipeline.Fit(trainData).Transform(trainData);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ public static void Example()
//0.165 0.117 -0.547 0.014

// A pipeline to project Features column into L-p normalized vector.
var lpNormalizePipeline = ml.Transforms.LpNormalize(nameof(SamplesUtils.DatasetUtils.SampleVectorOfNumbersData.Features), normKind: Transforms.LpNormalizingEstimatorBase.NormFunction.L1);
var lpNormalizePipeline = ml.Transforms.LpNormalize(nameof(SamplesUtils.DatasetUtils.SampleVectorOfNumbersData.Features), norm: Transforms.LpNormalizingEstimatorBase.NormFunction.L1);
Copy link
Contributor

@rogancarr rogancarr Mar 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norm [](start = 134, length = 4)

We should be using long-form names here. #Resolved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. LpNormalize will be replaced by LpNormNormalize.


In reply to: 264826528 [](ancestors = 264826528)

// The transformed (projected) data.
transformedData = lpNormalizePipeline.Fit(trainData).Transform(trainData);
// Getting the data of the newly created column, so we can preview it.
Expand Down
37 changes: 18 additions & 19 deletions src/Microsoft.ML.Data/Transforms/NormalizeColumn.cs
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
using Microsoft.ML;
using Microsoft.ML.CommandLine;
using Microsoft.ML.Data;
using Microsoft.ML.EntryPoints;
using Microsoft.ML.Internal.Internallearn;
using Microsoft.ML.Model.OnnxConverter;
using Microsoft.ML.Model.Pfa;
Expand Down Expand Up @@ -51,7 +50,7 @@ internal sealed partial class NormalizeTransform
public abstract class ColumnBase : OneToOneColumn
{
[Argument(ArgumentType.AtMostOnce, HelpText = "Max number of examples used to train the normalizer", ShortName = "maxtrain")]
public long? MaxTrainingExamples;
public long? MaximumExampleCount;

private protected ColumnBase()
{
Expand All @@ -60,7 +59,7 @@ private protected ColumnBase()
private protected override bool TryUnparseCore(StringBuilder sb)
{
Contracts.AssertValue(sb);
if (MaxTrainingExamples != null)
if (MaximumExampleCount != null)
return false;
return base.TryUnparseCore(sb);
}
Expand Down Expand Up @@ -183,7 +182,7 @@ public sealed class MeanVarArguments : AffineArgumentsBase
public abstract class ArgumentsBase : TransformInputBase
{
[Argument(ArgumentType.AtMostOnce, HelpText = "Max number of examples used to train the normalizer", ShortName = "maxtrain")]
public long MaxTrainingExamples = 1000000000;
public long MaximumExampleCount = 1000000000;

public abstract OneToOneColumn[] GetColumns();

Expand Down Expand Up @@ -291,7 +290,7 @@ internal static IDataTransform Create(IHostEnvironment env, MinMaxArguments args
.Select(col => new NormalizingEstimator.MinMaxColumnOptions(
col.Name,
col.Source ?? col.Name,
col.MaxTrainingExamples ?? args.MaxTrainingExamples,
col.MaximumExampleCount ?? args.MaximumExampleCount,
col.FixZero ?? args.FixZero))
.ToArray();
var normalizer = new NormalizingEstimator(env, columns);
Expand All @@ -306,10 +305,10 @@ internal static IDataTransform Create(IHostEnvironment env, MeanVarArguments arg
env.CheckValue(args.Columns, nameof(args.Columns));

var columns = args.Columns
.Select(col => new NormalizingEstimator.MeanVarColumnOptions(
.Select(col => new NormalizingEstimator.MeanVarianceColumnOptions(
col.Name,
col.Source ?? col.Name,
col.MaxTrainingExamples ?? args.MaxTrainingExamples,
col.MaximumExampleCount ?? args.MaximumExampleCount,
col.FixZero ?? args.FixZero))
.ToArray();
var normalizer = new NormalizingEstimator(env, columns);
Expand All @@ -326,10 +325,10 @@ internal static IDataTransform Create(IHostEnvironment env, LogMeanVarArguments
env.CheckValue(args.Columns, nameof(args.Columns));

var columns = args.Columns
.Select(col => new NormalizingEstimator.LogMeanVarColumnOptions(
.Select(col => new NormalizingEstimator.LogMeanVarianceColumnOptions(
col.Name,
col.Source ?? col.Name,
col.MaxTrainingExamples ?? args.MaxTrainingExamples,
col.MaximumExampleCount ?? args.MaximumExampleCount,
args.UseCdf))
.ToArray();
var normalizer = new NormalizingEstimator(env, columns);
Expand All @@ -349,7 +348,7 @@ internal static IDataTransform Create(IHostEnvironment env, BinArguments args, I
.Select(col => new NormalizingEstimator.BinningColumnOptions(
col.Name,
col.Source ?? col.Name,
col.MaxTrainingExamples ?? args.MaxTrainingExamples,
col.MaximumExampleCount ?? args.MaximumExampleCount,
col.FixZero ?? args.FixZero,
col.NumBins ?? args.NumBins))
.ToArray();
Expand Down Expand Up @@ -926,7 +925,7 @@ public static IColumnFunctionBuilder CreateBuilder(MinMaxArguments args, IHost h
return CreateBuilder(new NormalizingEstimator.MinMaxColumnOptions(
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.Columns[icol].FixZero ?? args.FixZero), host, srcIndex, srcType, cursor);
}

Expand Down Expand Up @@ -959,15 +958,15 @@ public static IColumnFunctionBuilder CreateBuilder(MeanVarArguments args, IHost
Contracts.AssertValue(host);
host.AssertValue(args);

return CreateBuilder(new NormalizingEstimator.MeanVarColumnOptions(
return CreateBuilder(new NormalizingEstimator.MeanVarianceColumnOptions(
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.Columns[icol].FixZero ?? args.FixZero,
args.UseCdf), host, srcIndex, srcType, cursor);
}

public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.MeanVarColumnOptions column, IHost host,
public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.MeanVarianceColumnOptions column, IHost host,
int srcIndex, DataViewType srcType, DataViewRowCursor cursor)
{
Contracts.AssertValue(host);
Expand Down Expand Up @@ -999,14 +998,14 @@ public static IColumnFunctionBuilder CreateBuilder(LogMeanVarArguments args, IHo
Contracts.AssertValue(host);
host.AssertValue(args);

return CreateBuilder(new NormalizingEstimator.LogMeanVarColumnOptions(
return CreateBuilder(new NormalizingEstimator.LogMeanVarianceColumnOptions(
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.UseCdf), host, srcIndex, srcType, cursor);
}

public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.LogMeanVarColumnOptions column, IHost host,
public static IColumnFunctionBuilder CreateBuilder(NormalizingEstimator.LogMeanVarianceColumnOptions column, IHost host,
int srcIndex, DataViewType srcType, DataViewRowCursor cursor)
{
Contracts.AssertValue(host);
Expand Down Expand Up @@ -1041,7 +1040,7 @@ public static IColumnFunctionBuilder CreateBuilder(BinArguments args, IHost host
return CreateBuilder(new NormalizingEstimator.BinningColumnOptions(
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.Columns[icol].FixZero ?? args.FixZero,
args.Columns[icol].NumBins ?? args.NumBins), host, srcIndex, srcType, cursor);
}
Expand Down Expand Up @@ -1091,7 +1090,7 @@ public static IColumnFunctionBuilder CreateBuilder(SupervisedBinArguments args,
args.Columns[icol].Name,
args.Columns[icol].Source ?? args.Columns[icol].Name,
args.LabelColumn ?? DefaultColumnNames.Label,
args.Columns[icol].MaxTrainingExamples ?? args.MaxTrainingExamples,
args.Columns[icol].MaximumExampleCount ?? args.MaximumExampleCount,
args.Columns[icol].FixZero ?? args.FixZero,
args.Columns[icol].NumBins ?? args.NumBins,
args.MinBinSize),
Expand Down
Loading