Skip to content

Conversion of DropSlots, MutualInformationFeatureSelection, and CountFeatureSelection into estimator and transformers #1683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Nov 27, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
e129445
beginning to work on dropslotstransform
artidoro Nov 8, 2018
4968169
started working on dropslotstransform
artidoro Nov 8, 2018
ce6e733
work on dropslots
artidoro Nov 8, 2018
4b0d716
workign on dropslots
artidoro Nov 8, 2018
733a8d2
working dropslots
artidoro Nov 9, 2018
5cd5211
workign on dropslots
artidoro Nov 14, 2018
75df87e
tried to change argument of mutualinformationfeatureselection buyt it…
artidoro Nov 14, 2018
d9975bc
reverted arguemnts back
artidoro Nov 14, 2018
e2ffe99
started working on tests
artidoro Nov 15, 2018
50076cf
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro Nov 15, 2018
827700d
workign on nop
artidoro Nov 15, 2018
49e6f83
merge
artidoro Nov 15, 2018
b05110b
dropslots starting to get some tests working
artidoro Nov 16, 2018
7af2057
merge
artidoro Nov 16, 2018
a5a91ba
completing merge
artidoro Nov 16, 2018
a98c8a2
work on mutualinformation
artidoro Nov 18, 2018
8ff7e23
first non columninfo version of mutualinformation
artidoro Nov 19, 2018
15e2eb2
commenting
artidoro Nov 19, 2018
336732e
commenting
artidoro Nov 19, 2018
c7dccec
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro Nov 19, 2018
e5cc371
merging
artidoro Nov 19, 2018
209492c
tests working besides the command line tests
artidoro Nov 20, 2018
3e9217e
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro Nov 20, 2018
7156f6a
merge
artidoro Nov 20, 2018
098bcf9
fixed some tests
artidoro Nov 20, 2018
e40f42f
renaming
artidoro Nov 20, 2018
2539fa0
fixed more tests
artidoro Nov 20, 2018
0d4605a
more tests
artidoro Nov 20, 2018
c3a5149
fixed all tests
artidoro Nov 20, 2018
df9ef2b
review comments
artidoro Nov 20, 2018
7226216
adding samples for featureselection
artidoro Nov 21, 2018
f0bc2e6
remaning file
artidoro Nov 21, 2018
f093fae
review comments
artidoro Nov 27, 2018
8e9c307
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro Nov 27, 2018
4675885
merge
artidoro Nov 27, 2018
05ea809
review comments
artidoro Nov 27, 2018
fbb7475
changed mlextension names
artidoro Nov 27, 2018
0f2e815
review comment
artidoro Nov 27, 2018
3bc1fdd
merge conflict
artidoro Nov 27, 2018
318a1ae
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro Nov 27, 2018
8d19b36
rename of file
artidoro Nov 27, 2018
695cdc9
csharpapi regenerated
artidoro Nov 27, 2018
eea19f4
merge
artidoro Nov 27, 2018
2fbfbba
merge
artidoro Nov 27, 2018
fb6120d
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro Nov 27, 2018
7e5ec2d
regenerate csharpapi
artidoro Nov 27, 2018
2669fa5
use master libmf
artidoro Nov 27, 2018
8000682
Merge branch 'master' into dropslotspr
artidoro Nov 27, 2018
3fb399b
libmf
artidoro Nov 27, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/code/MlNetCookBook.md
Original file line number Diff line number Diff line change
Expand Up @@ -1055,7 +1055,7 @@ var dynamicPipeline =
.Append(mlContext.Transforms.Categorical.OneHotEncoding("CategoricalFeatures", "CategoricalBag", CategoricalTransform.OutputKind.Bag))
// One-hot encode the workclass column, then drop all the categories that have fewer than 10 instances in the train set.
.Append(mlContext.Transforms.Categorical.OneHotEncoding("Workclass", "WorkclassOneHot"))
.Append(new CountFeatureSelector(mlContext, "WorkclassOneHot", "WorkclassOneHotTrimmed", count: 10));
.Append(mlContext.Transforms.FeatureSelection.CountFeatureSelectingEstimator("WorkclassOneHot", "WorkclassOneHotTrimmed", count: 10));

// Let's train our pipeline, and then apply it to the same data.
var transformedData = dynamicPipeline.Fit(data).Transform(data);
Expand Down
123 changes: 123 additions & 0 deletions docs/samples/Microsoft.ML.Samples/Dynamic/FeatureSelectionTransform.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
using Microsoft.ML.Data;
using Microsoft.ML.Runtime.Data;
using System;
using System.Collections.Generic;

namespace Microsoft.ML.Samples.Dynamic
{
public class FeatureSelectionTransformExample
{
public static void FeatureSelectionTransform()
{
// Downloading a classification dataset from github.com/dotnet/machinelearning.
// It will be stored in the same path as the executable
string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset();

// Data Preview
// 1. Label 0=benign, 1=malignant
// 2. Clump Thickness 1 - 10
// 3. Uniformity of Cell Size 1 - 10
// 4. Uniformity of Cell Shape 1 - 10
// 5. Marginal Adhesion 1 - 10
// 6. Single Epithelial Cell Size 1 - 10
// 7. Bare Nuclei 1 - 10
// 8. Bland Chromatin 1 - 10
// 9. Normal Nucleoli 1 - 10
// 10. Mitoses 1 - 10

// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var ml = new MLContext();

// First, we define the reader: specify the data columns and where to find them in the text file. Notice that we combine entries from
// all the feature columns into entries of a vector of a single column named "Features".
var reader = ml.Data.TextReader(new TextLoader.Arguments()
{
Separator = "tab",
HasHeader = true,
Column = new[]
{
new TextLoader.Column("Label", DataKind.BL, 0),
new TextLoader.Column("Features", DataKind.Num, new [] { new TextLoader.Range(1, 9) })
}
});

// Then, we use the reader to read the data as an IDataView.
var data = reader.Read(dataFilePath);

// Second, we define the transformations that we apply on the data. Remember that an Estimator does not transform data
// directly, but it needs to be trained on data using .Fit(), and it will output a Transformer, which can transform data.

// In this example we define a CountFeatureSelectingEstimator, that selects slots in a feature vector that have more non-default
// values than the specified count. This transformation can be used to remove slots with too many missing values.
var countSelectEst = ml.Transforms.FeatureSelection.SelectFeaturesBasedOnCount(
inputColumn: "Features", outputColumn: "FeaturesCountSelect", count: 695);

// We also define a MutualInformationFeatureSelectingEstimator that selects the top k slots in a feature
// vector based on highest mutual information between that slot and a specified label. Notice that it is possible to
// specify the parameter `numBins', which controls the number of bins used in the approximation of the mutual information
// between features and label.
var mutualInfoEst = ml.Transforms.FeatureSelection.SelectFeaturesBasedOnMutualInformation(
inputColumn: "FeaturesCountSelect", outputColumn: "FeaturesMISelect", labelColumn: "Label", slotsInOutput: 5);

// Now, we can put the previous two transformations together in a pipeline.
var pipeline = countSelectEst.Append(mutualInfoEst);

// The pipeline can then be trained, using .Fit(), and the resulting transformer can be used to transform data.
var transformedData = pipeline.Fit(data).Transform(data);

// Small helper to print the data inside a column, in the console. Only prints the first 10 rows.
Action<string, IEnumerable<VBuffer<float>>> printHelper = (columnName, column) =>
{
Console.WriteLine($"{columnName} column obtained post-transformation.");
int count = 0;
foreach (var row in column)
{
foreach (var value in row.GetValues())
Console.Write($"{value}\t");
Console.WriteLine("");
count++;
if (count >= 10)
break;
}

Console.WriteLine("===================================================");
};

// Print the data that results from the transformations.
var countSelectColumn = transformedData.GetColumn<VBuffer<float>>(ml, "FeaturesCountSelect");
var MISelectColumn = transformedData.GetColumn<VBuffer<float>>(ml, "FeaturesMISelect");
printHelper("FeaturesCountSelect", countSelectColumn);
printHelper("FeaturesMISelect", MISelectColumn);

// Below is the output of the this code. We see that some slots habe been dropped by the first transformation.
// Among the remaining slots, the second transformation only preserves the top 5 slots based on mutualinformation
// with the label column.

// FeaturesCountSelect column obtained post-transformation.
// 5 4 4 5 7 3 2 1
// 3 1 1 1 2 3 1 1
// 6 8 8 1 3 3 7 1
// 4 1 1 3 2 3 1 1
// 8 10 10 8 7 9 7 1
// 1 1 1 1 2 3 1 1
// 2 1 2 1 2 3 1 1
// 2 1 1 1 2 1 1 5
// 4 2 1 1 2 2 1 1
// 1 1 1 1 1 3 1 1
// ===================================================
// FeaturesMISelect column obtained post-transformation.
// 4 4 7 3 2
// 1 1 2 3 1
// 8 8 3 3 7
// 1 1 2 3 1
// 10 10 7 9 7
// 1 1 2 3 1
// 1 2 2 3 1
// 1 1 2 1 1
// 2 1 2 2 1
// 1 1 1 3 1
// ===================================================
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
using Microsoft.ML.StaticPipe;
using Microsoft.ML.Transforms;
using Microsoft.ML.Transforms.Categorical;
using Microsoft.ML.Transforms.FeatureSelection;
using System;

namespace Microsoft.ML.Samples.Static
Expand Down
122 changes: 122 additions & 0 deletions docs/samples/Microsoft.ML.Samples/Static/FeatureSelectionTransform.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
using Microsoft.ML.Data;
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.StaticPipe;
using System;
using System.Collections.Generic;

namespace Microsoft.ML.Samples.Dynamic
{
public class FeatureSelectionTransformStaticExample
{
public static void FeatureSelectionTransform()
Copy link
Member

@abgoswam abgoswam Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureSelectionTransform [](start = 27, length = 25)

do we want to refer to this example in the FeatureSelectionCatalog.cs file ?
#Resolved

{
// Downloading a classification dataset from github.com/dotnet/machinelearning.
// It will be stored in the same path as the executable
string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset();

// Data Preview
// 1. Label 0=benign, 1=malignant
// 2. Clump Thickness 1 - 10
// 3. Uniformity of Cell Size 1 - 10
// 4. Uniformity of Cell Shape 1 - 10
// 5. Marginal Adhesion 1 - 10
// 6. Single Epithelial Cell Size 1 - 10
// 7. Bare Nuclei 1 - 10
// 8. Bland Chromatin 1 - 10
// 9. Normal Nucleoli 1 - 10
// 10. Mitoses 1 - 10

// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var ml = new MLContext();

// First, we define the reader: specify the data columns and where to find them in the text file. Notice that we combine entries from
// all the feature columns into entries of a vector of a single column named "Features".
var reader = TextLoader.CreateReader(ml, c => (
Label: c.LoadBool(0),
Features: c.LoadFloat(1, 9)
),
separator: '\t', hasHeader: true);

// Then, we use the reader to read the data as an IDataView.
var data = reader.Read(dataFilePath);

// Second, we define the transformations that we apply on the data. Remember that an Estimator does not transform data
// directly, but it needs to be trained on data using .Fit(), and it will output a Transformer, which can transform data.

// In this example we define a CountFeatureSelectingEstimator, that selects slots in a feature vector that have more non-default
// values than the specified count. This transformation can be used to remove slots with too many missing values.
// We also define a MutualInformationFeatureSelectingEstimator that selects the top k slots in a feature
// vector based on highest mutual information between that slot and a specified label. Notice that it is possible to
// specify the parameter `numBins', which controls the number of bins used in the approximation of the mutual information
// between features and label.
var pipeline = reader.MakeNewEstimator()
.Append(r =>(
FeaturesCountSelect: r.Features.SelectFeaturesBasedOnCount(count: 695),
Copy link
Contributor Author

@artidoro artidoro Nov 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SelectFeaturesBasedOnCount [](start = 52, length = 26)

Is this the correct way to show the static api? Or should I use MLContext to find the estimator? #Resolved

Label: r.Label
))
.Append(r => (
FeaturesCountSelect: r.FeaturesCountSelect,
FeaturesMISelect: r.FeaturesCountSelect.SelectFeaturesBasedOnMutualInformation(r.Label, slotsInOutput: 5),
Label: r.Label
));


// The pipeline can then be trained, using .Fit(), and the resulting transformer can be used to transform data.
var transformedData = pipeline.Fit(data).Transform(data);

// Small helper to print the data inside a column, in the console. Only prints the first 10 rows.
Action<string, IEnumerable<VBuffer<float>>> printHelper = (columnName, column) =>
{
Console.WriteLine($"{columnName} column obtained post-transformation.");
int count = 0;
foreach (var row in column)
{
foreach (var value in row.GetValues())
Console.Write($"{value}\t");
Console.WriteLine("");
count++;
if (count >= 10)
break;
}

Console.WriteLine("===================================================");
};

// Print the data that results from the transformations.
var countSelectColumn = transformedData.AsDynamic.GetColumn<VBuffer<float>>(ml, "FeaturesCountSelect");
var MISelectColumn = transformedData.AsDynamic.GetColumn<VBuffer<float>>(ml, "FeaturesMISelect");
printHelper("FeaturesCountSelect", countSelectColumn);
printHelper("FeaturesMISelect", MISelectColumn);

// Below is the output of the this code. We see that some slots habe been dropped by the first transformation.
// Among the remaining slots, the second transformation only preserves the top 5 slots based on mutualinformation
// with the label column.

// FeaturesCountSelect column obtained post-transformation.
// 5 4 4 5 7 3 2 1
// 3 1 1 1 2 3 1 1
// 6 8 8 1 3 3 7 1
// 4 1 1 3 2 3 1 1
// 8 10 10 8 7 9 7 1
// 1 1 1 1 2 3 1 1
// 2 1 2 1 2 3 1 1
// 2 1 1 1 2 1 1 5
// 4 2 1 1 2 2 1 1
// 1 1 1 1 1 3 1 1
// ===================================================
// FeaturesMISelect column obtained post-transformation.
// 4 4 7 3 2
// 1 1 2 3 1
// 8 8 3 3 7
// 1 1 2 3 1
// 10 10 7 9 7
// 1 1 2 3 1
// 1 2 2 3 1
// 1 1 2 1 1
// 2 1 2 2 1
// 1 1 1 3 1
// ===================================================
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
using Microsoft.ML.StaticPipe;
using Microsoft.ML.Transforms;
using Microsoft.ML.Transforms.Categorical;
using Microsoft.ML.Transforms.FeatureSelection;
using System;

namespace Microsoft.ML.Samples.Static
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
using Microsoft.ML.StaticPipe;
using Microsoft.ML.Transforms;
using Microsoft.ML.Transforms.Categorical;
using Microsoft.ML.Transforms.FeatureSelection;
using System;

namespace Microsoft.ML.Samples.Static
Expand Down
42 changes: 3 additions & 39 deletions src/Microsoft.ML.Data/Evaluators/ClusteringEvaluator.cs
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
using Microsoft.ML.Runtime.Internal.Utilities;
using Microsoft.ML.Runtime.Model;
using Microsoft.ML.Runtime.Numeric;
using Microsoft.ML.Transforms;
using Microsoft.ML.Transforms.FeatureSelection;
using System;
using System.Collections.Generic;
using System.Linq;
Expand Down Expand Up @@ -877,50 +877,14 @@ protected override IDataView GetPerInstanceMetricsCore(IDataView perInst, RoleMa
{
var type = perInst.Schema.GetColumnType(index);
if (_numTopClusters < type.VectorSize)
{
var args = new DropSlotsTransform.Arguments
{
Column = new DropSlotsTransform.Column[]
{
new DropSlotsTransform.Column()
{
Name = ClusteringPerInstanceEvaluator.SortedClusters,
Slots = new[] {
new DropSlotsTransform.Range()
{
Min = _numTopClusters
}
}
}
}
};
perInst = new DropSlotsTransform(Host, args, perInst);
}
perInst = new SlotsDroppingTransformer(Host, ClusteringPerInstanceEvaluator.SortedClusters, min: _numTopClusters).Transform(perInst);
}

if (perInst.Schema.TryGetColumnIndex(ClusteringPerInstanceEvaluator.SortedClusterScores, out index))
{
var type = perInst.Schema.GetColumnType(index);
if (_numTopClusters < type.VectorSize)
{
var args = new DropSlotsTransform.Arguments
{
Column = new DropSlotsTransform.Column[]
{
new DropSlotsTransform.Column()
{
Name = ClusteringPerInstanceEvaluator.SortedClusterScores,
Slots = new[] {
new DropSlotsTransform.Range()
{
Min = _numTopClusters
}
}
}
}
};
perInst = new DropSlotsTransform(Host, args, perInst);
}
perInst = new SlotsDroppingTransformer(Host, ClusteringPerInstanceEvaluator.SortedClusterScores, min: _numTopClusters).Transform(perInst);
}
return perInst;
}
Expand Down
Loading