-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Conversion of DropSlots, MutualInformationFeatureSelection, and CountFeatureSelection into estimator and transformers #1683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
e129445
beginning to work on dropslotstransform
artidoro 4968169
started working on dropslotstransform
artidoro ce6e733
work on dropslots
artidoro 4b0d716
workign on dropslots
artidoro 733a8d2
working dropslots
artidoro 5cd5211
workign on dropslots
artidoro 75df87e
tried to change argument of mutualinformationfeatureselection buyt it…
artidoro d9975bc
reverted arguemnts back
artidoro e2ffe99
started working on tests
artidoro 50076cf
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro 827700d
workign on nop
artidoro 49e6f83
merge
artidoro b05110b
dropslots starting to get some tests working
artidoro 7af2057
merge
artidoro a5a91ba
completing merge
artidoro a98c8a2
work on mutualinformation
artidoro 8ff7e23
first non columninfo version of mutualinformation
artidoro 15e2eb2
commenting
artidoro 336732e
commenting
artidoro c7dccec
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro e5cc371
merging
artidoro 209492c
tests working besides the command line tests
artidoro 3e9217e
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro 7156f6a
merge
artidoro 098bcf9
fixed some tests
artidoro e40f42f
renaming
artidoro 2539fa0
fixed more tests
artidoro 0d4605a
more tests
artidoro c3a5149
fixed all tests
artidoro df9ef2b
review comments
artidoro 7226216
adding samples for featureselection
artidoro f0bc2e6
remaning file
artidoro f093fae
review comments
artidoro 8e9c307
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro 4675885
merge
artidoro 05ea809
review comments
artidoro fbb7475
changed mlextension names
artidoro 0f2e815
review comment
artidoro 3bc1fdd
merge conflict
artidoro 318a1ae
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro 8d19b36
rename of file
artidoro 695cdc9
csharpapi regenerated
artidoro eea19f4
merge
artidoro 2fbfbba
merge
artidoro fb6120d
Merge branch 'master' of https://github.com/dotnet/machinelearning in…
artidoro 7e5ec2d
regenerate csharpapi
artidoro 2669fa5
use master libmf
artidoro 8000682
Merge branch 'master' into dropslotspr
artidoro 3fb399b
libmf
artidoro File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
123 changes: 123 additions & 0 deletions
123
docs/samples/Microsoft.ML.Samples/Dynamic/FeatureSelectionTransform.cs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,123 @@ | ||
using Microsoft.ML.Data; | ||
using Microsoft.ML.Runtime.Data; | ||
using System; | ||
using System.Collections.Generic; | ||
|
||
namespace Microsoft.ML.Samples.Dynamic | ||
{ | ||
public class FeatureSelectionTransformExample | ||
{ | ||
public static void FeatureSelectionTransform() | ||
{ | ||
// Downloading a classification dataset from github.com/dotnet/machinelearning. | ||
// It will be stored in the same path as the executable | ||
string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset(); | ||
|
||
// Data Preview | ||
// 1. Label 0=benign, 1=malignant | ||
// 2. Clump Thickness 1 - 10 | ||
// 3. Uniformity of Cell Size 1 - 10 | ||
// 4. Uniformity of Cell Shape 1 - 10 | ||
// 5. Marginal Adhesion 1 - 10 | ||
// 6. Single Epithelial Cell Size 1 - 10 | ||
// 7. Bare Nuclei 1 - 10 | ||
// 8. Bland Chromatin 1 - 10 | ||
// 9. Normal Nucleoli 1 - 10 | ||
// 10. Mitoses 1 - 10 | ||
|
||
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging, | ||
// as well as the source of randomness. | ||
var ml = new MLContext(); | ||
|
||
// First, we define the reader: specify the data columns and where to find them in the text file. Notice that we combine entries from | ||
// all the feature columns into entries of a vector of a single column named "Features". | ||
var reader = ml.Data.TextReader(new TextLoader.Arguments() | ||
{ | ||
Separator = "tab", | ||
HasHeader = true, | ||
Column = new[] | ||
{ | ||
new TextLoader.Column("Label", DataKind.BL, 0), | ||
new TextLoader.Column("Features", DataKind.Num, new [] { new TextLoader.Range(1, 9) }) | ||
} | ||
}); | ||
|
||
// Then, we use the reader to read the data as an IDataView. | ||
var data = reader.Read(dataFilePath); | ||
|
||
// Second, we define the transformations that we apply on the data. Remember that an Estimator does not transform data | ||
// directly, but it needs to be trained on data using .Fit(), and it will output a Transformer, which can transform data. | ||
|
||
// In this example we define a CountFeatureSelectingEstimator, that selects slots in a feature vector that have more non-default | ||
// values than the specified count. This transformation can be used to remove slots with too many missing values. | ||
var countSelectEst = ml.Transforms.FeatureSelection.SelectFeaturesBasedOnCount( | ||
inputColumn: "Features", outputColumn: "FeaturesCountSelect", count: 695); | ||
|
||
// We also define a MutualInformationFeatureSelectingEstimator that selects the top k slots in a feature | ||
// vector based on highest mutual information between that slot and a specified label. Notice that it is possible to | ||
// specify the parameter `numBins', which controls the number of bins used in the approximation of the mutual information | ||
// between features and label. | ||
var mutualInfoEst = ml.Transforms.FeatureSelection.SelectFeaturesBasedOnMutualInformation( | ||
inputColumn: "FeaturesCountSelect", outputColumn: "FeaturesMISelect", labelColumn: "Label", slotsInOutput: 5); | ||
|
||
// Now, we can put the previous two transformations together in a pipeline. | ||
var pipeline = countSelectEst.Append(mutualInfoEst); | ||
|
||
// The pipeline can then be trained, using .Fit(), and the resulting transformer can be used to transform data. | ||
var transformedData = pipeline.Fit(data).Transform(data); | ||
|
||
// Small helper to print the data inside a column, in the console. Only prints the first 10 rows. | ||
Action<string, IEnumerable<VBuffer<float>>> printHelper = (columnName, column) => | ||
{ | ||
Console.WriteLine($"{columnName} column obtained post-transformation."); | ||
int count = 0; | ||
foreach (var row in column) | ||
{ | ||
foreach (var value in row.GetValues()) | ||
Console.Write($"{value}\t"); | ||
Console.WriteLine(""); | ||
count++; | ||
if (count >= 10) | ||
break; | ||
} | ||
|
||
Console.WriteLine("==================================================="); | ||
}; | ||
|
||
// Print the data that results from the transformations. | ||
var countSelectColumn = transformedData.GetColumn<VBuffer<float>>(ml, "FeaturesCountSelect"); | ||
var MISelectColumn = transformedData.GetColumn<VBuffer<float>>(ml, "FeaturesMISelect"); | ||
printHelper("FeaturesCountSelect", countSelectColumn); | ||
printHelper("FeaturesMISelect", MISelectColumn); | ||
|
||
// Below is the output of the this code. We see that some slots habe been dropped by the first transformation. | ||
// Among the remaining slots, the second transformation only preserves the top 5 slots based on mutualinformation | ||
// with the label column. | ||
|
||
// FeaturesCountSelect column obtained post-transformation. | ||
// 5 4 4 5 7 3 2 1 | ||
// 3 1 1 1 2 3 1 1 | ||
// 6 8 8 1 3 3 7 1 | ||
// 4 1 1 3 2 3 1 1 | ||
// 8 10 10 8 7 9 7 1 | ||
// 1 1 1 1 2 3 1 1 | ||
// 2 1 2 1 2 3 1 1 | ||
// 2 1 1 1 2 1 1 5 | ||
// 4 2 1 1 2 2 1 1 | ||
// 1 1 1 1 1 3 1 1 | ||
// =================================================== | ||
// FeaturesMISelect column obtained post-transformation. | ||
// 4 4 7 3 2 | ||
// 1 1 2 3 1 | ||
// 8 8 3 3 7 | ||
// 1 1 2 3 1 | ||
// 10 10 7 9 7 | ||
// 1 1 2 3 1 | ||
// 1 2 2 3 1 | ||
// 1 1 2 1 1 | ||
// 2 1 2 2 1 | ||
// 1 1 1 3 1 | ||
// =================================================== | ||
} | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
122 changes: 122 additions & 0 deletions
122
docs/samples/Microsoft.ML.Samples/Static/FeatureSelectionTransform.cs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,122 @@ | ||
using Microsoft.ML.Data; | ||
using Microsoft.ML.Runtime.Data; | ||
using Microsoft.ML.StaticPipe; | ||
using System; | ||
using System.Collections.Generic; | ||
|
||
namespace Microsoft.ML.Samples.Dynamic | ||
{ | ||
public class FeatureSelectionTransformStaticExample | ||
{ | ||
public static void FeatureSelectionTransform() | ||
{ | ||
// Downloading a classification dataset from github.com/dotnet/machinelearning. | ||
// It will be stored in the same path as the executable | ||
string dataFilePath = SamplesUtils.DatasetUtils.DownloadBreastCancerDataset(); | ||
|
||
// Data Preview | ||
// 1. Label 0=benign, 1=malignant | ||
// 2. Clump Thickness 1 - 10 | ||
// 3. Uniformity of Cell Size 1 - 10 | ||
// 4. Uniformity of Cell Shape 1 - 10 | ||
// 5. Marginal Adhesion 1 - 10 | ||
// 6. Single Epithelial Cell Size 1 - 10 | ||
// 7. Bare Nuclei 1 - 10 | ||
// 8. Bland Chromatin 1 - 10 | ||
// 9. Normal Nucleoli 1 - 10 | ||
// 10. Mitoses 1 - 10 | ||
|
||
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging, | ||
// as well as the source of randomness. | ||
var ml = new MLContext(); | ||
|
||
// First, we define the reader: specify the data columns and where to find them in the text file. Notice that we combine entries from | ||
// all the feature columns into entries of a vector of a single column named "Features". | ||
var reader = TextLoader.CreateReader(ml, c => ( | ||
Label: c.LoadBool(0), | ||
Features: c.LoadFloat(1, 9) | ||
), | ||
separator: '\t', hasHeader: true); | ||
|
||
// Then, we use the reader to read the data as an IDataView. | ||
var data = reader.Read(dataFilePath); | ||
|
||
// Second, we define the transformations that we apply on the data. Remember that an Estimator does not transform data | ||
// directly, but it needs to be trained on data using .Fit(), and it will output a Transformer, which can transform data. | ||
|
||
// In this example we define a CountFeatureSelectingEstimator, that selects slots in a feature vector that have more non-default | ||
// values than the specified count. This transformation can be used to remove slots with too many missing values. | ||
// We also define a MutualInformationFeatureSelectingEstimator that selects the top k slots in a feature | ||
// vector based on highest mutual information between that slot and a specified label. Notice that it is possible to | ||
// specify the parameter `numBins', which controls the number of bins used in the approximation of the mutual information | ||
// between features and label. | ||
var pipeline = reader.MakeNewEstimator() | ||
.Append(r =>( | ||
FeaturesCountSelect: r.Features.SelectFeaturesBasedOnCount(count: 695), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Is this the correct way to show the static api? Or should I use MLContext to find the estimator? #Resolved |
||
Label: r.Label | ||
)) | ||
.Append(r => ( | ||
FeaturesCountSelect: r.FeaturesCountSelect, | ||
FeaturesMISelect: r.FeaturesCountSelect.SelectFeaturesBasedOnMutualInformation(r.Label, slotsInOutput: 5), | ||
Label: r.Label | ||
)); | ||
|
||
|
||
// The pipeline can then be trained, using .Fit(), and the resulting transformer can be used to transform data. | ||
var transformedData = pipeline.Fit(data).Transform(data); | ||
|
||
// Small helper to print the data inside a column, in the console. Only prints the first 10 rows. | ||
Action<string, IEnumerable<VBuffer<float>>> printHelper = (columnName, column) => | ||
{ | ||
Console.WriteLine($"{columnName} column obtained post-transformation."); | ||
int count = 0; | ||
foreach (var row in column) | ||
{ | ||
foreach (var value in row.GetValues()) | ||
Console.Write($"{value}\t"); | ||
Console.WriteLine(""); | ||
count++; | ||
if (count >= 10) | ||
break; | ||
} | ||
|
||
Console.WriteLine("==================================================="); | ||
}; | ||
|
||
// Print the data that results from the transformations. | ||
var countSelectColumn = transformedData.AsDynamic.GetColumn<VBuffer<float>>(ml, "FeaturesCountSelect"); | ||
var MISelectColumn = transformedData.AsDynamic.GetColumn<VBuffer<float>>(ml, "FeaturesMISelect"); | ||
printHelper("FeaturesCountSelect", countSelectColumn); | ||
printHelper("FeaturesMISelect", MISelectColumn); | ||
|
||
// Below is the output of the this code. We see that some slots habe been dropped by the first transformation. | ||
// Among the remaining slots, the second transformation only preserves the top 5 slots based on mutualinformation | ||
// with the label column. | ||
|
||
// FeaturesCountSelect column obtained post-transformation. | ||
// 5 4 4 5 7 3 2 1 | ||
// 3 1 1 1 2 3 1 1 | ||
// 6 8 8 1 3 3 7 1 | ||
// 4 1 1 3 2 3 1 1 | ||
// 8 10 10 8 7 9 7 1 | ||
// 1 1 1 1 2 3 1 1 | ||
// 2 1 2 1 2 3 1 1 | ||
// 2 1 1 1 2 1 1 5 | ||
// 4 2 1 1 2 2 1 1 | ||
// 1 1 1 1 1 3 1 1 | ||
// =================================================== | ||
// FeaturesMISelect column obtained post-transformation. | ||
// 4 4 7 3 2 | ||
// 1 1 2 3 1 | ||
// 8 8 3 3 7 | ||
// 1 1 2 3 1 | ||
// 10 10 7 9 7 | ||
// 1 1 2 3 1 | ||
// 1 2 2 3 1 | ||
// 1 1 2 1 1 | ||
// 2 1 2 2 1 | ||
// 1 1 1 3 1 | ||
// =================================================== | ||
} | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to refer to this example in the FeatureSelectionCatalog.cs file ?
#Resolved