Skip to content

Scrubbing FieldAwareFactorizationMachine learner. #2730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Feb 27, 2019
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/samples/Microsoft.ML.Samples/Dynamic/Calibrator.cs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ public static void Example()
// This will create a sentiment.tsv file in the filesystem.
// The string, dataFile, is the path to the downloaded file.
// You can open this file, if you want to see the data.
string dataFile = SamplesUtils.DatasetUtils.DownloadSentimentDataset();
string dataFile = SamplesUtils.DatasetUtils.DownloadSentimentDataset()[0];

// A preview of the data.
// Sentiment SentimentText
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
using System;
using System.Linq;
using Microsoft.ML.Data;
namespace Microsoft.ML.Samples.Dynamic
{
public static class FFMBinaryClassification
{
public static void Example()
{
// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness.
var mlContext = new MLContext();

// Download and featurize the dataset.
var dataviews = SamplesUtils.DatasetUtils.LoadFeaturizedSentimentDataset(mlContext);
var trainData = dataviews[0];
var testData = dataviews[1];

// ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times, it can be slow due to
// expensive featurization and disk operations. When the considered data can fit into memory, a solution is to cache the data in memory. Caching is especially
// helpful when working with iterative algorithms which needs many data passes. Since SDCA is the case, we cache. Inserting a
// cache step in a pipeline is also possible, please see the construction of pipeline below.
trainData = mlContext.Data.Cache(trainData);

// Step 2: Pipeline
// Create the 'FieldAwareFactorizationMachine' binary classifier, setting the "Sentiment" column as the label of the dataset, and
// the "Features" column as the features column.
var pipeline = new EstimatorChain<ITransformer>().AppendCacheCheckpoint(mlContext)
.Append(mlContext.BinaryClassification.Trainers.
FieldAwareFactorizationMachine(labelColumnName: "Sentiment", featureColumnNames: new[] { "Features" }));

// Fit the model.
var model = pipeline.Fit(trainData);

// Let's get the model parameters from the model.
var modelParams = model.LastTransformer.Model;

// Let's inspect the model parameters.
var featureCount = modelParams.FeatureCount;
var fieldCount = modelParams.FieldCount;
var latentDim = modelParams.LatentDimension;
var linearWeights = modelParams.GetLinearWeights();
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var linearWeights = modelParams.GetLinearWeights(); [](start = 12, length = 51)

I think right now I can do following:
Evaluate(model.Transform(Data)) -> AUC = X1

var linearWeights = modelParams.GetLinearWeights();
linearWeights[0] =100;
linearWeights[1] =200;

Evaluate(model.Transform(Data)) -> AUC = X2

X1 != X2

Which is awful.

Can you check MatrixFactoraztionPredictor and how it handles arrays?

I also don't understand GetFeatureCoun() functions, why can't I just do modelParams.FeatureCount?
#Resolved

var latentWeights = modelParams.GetLatentWeights();

Console.WriteLine("The feature count is: " + featureCount);
Console.WriteLine("The number of fields is: " + fieldCount);
Console.WriteLine("The latent dimension is: " + latentDim);
Console.WriteLine("The linear weights of some of the features are: " +
string.Concat(Enumerable.Range(1, 10).Select(i => $"{linearWeights[i]:F4} ")));
Console.WriteLine("The weights of some of the latent features are: " +
string.Concat(Enumerable.Range(1, 10).Select(i => $"{latentWeights[i]:F4} ")));

// The feature count is: 9374
// The number of fields is: 1
// The latent dimension is: 20
// The linear weights of some of the features are: 0.0196 0.0000 -0.0045 -0.0205 0.0000 0.0032 0.0682 0.0091 -0.0151 0.0089
// The weights of some of the latent features are: 0.3316 0.2140 0.0752 0.0908 -0.0495 -0.0810 0.0761 0.0966 0.0090 -0.0962

// Evaluate how the model is doing on the test data.
var dataWithPredictions = model.Transform(testData);

var metrics = mlContext.BinaryClassification.Evaluate(dataWithPredictions, "Sentiment");
SamplesUtils.ConsoleUtils.PrintMetrics(metrics);

// Accuracy: 0.72
// AUC: 0.75
// F1 Score: 0.74
// Negative Precision: 0.75
// Negative Recall: 0.67
// Positive Precision: 0.70
// Positive Recall: 0.78
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
using System;
using System.Linq;
using Microsoft.ML.Data;
using Microsoft.ML.FactorizationMachine;

namespace Microsoft.ML.Samples.Dynamic
{
public static class FFMBinaryClassificationWithOptions
{
public static void Example()
{
// Create a new context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness.
var mlContext = new MLContext();

// Download and featurize the dataset.
var dataviews = SamplesUtils.DatasetUtils.LoadFeaturizedSentimentDataset(mlContext);
var trainData = dataviews[0];
var testData = dataviews[1];

// ML.NET doesn't cache data set by default. Therefore, if one reads a data set from a file and accesses it many times, it can be slow due to
// expensive featurization and disk operations. When the considered data can fit into memory, a solution is to cache the data in memory. Caching is especially
// helpful when working with iterative algorithms which needs many data passes. Since SDCA is the case, we cache. Inserting a
// cache step in a pipeline is also possible, please see the construction of pipeline below.
trainData = mlContext.Data.Cache(trainData);

// Step 2: Pipeline
// Create the 'FieldAwareFactorizationMachine' binary classifier, setting the "Sentiment" column as the label of the dataset, and
// the "Features" column as the features column.
var pipeline = new EstimatorChain<ITransformer>().AppendCacheCheckpoint(mlContext)
.Append(mlContext.BinaryClassification.Trainers.
FieldAwareFactorizationMachine(
new FieldAwareFactorizationMachineTrainer.Options
{
FeatureColumn = "Features",
LabelColumn = "Sentiment",
LearningRate = 0.1f,
Iterations = 10
}));

// Fit the model.
var model = pipeline.Fit(trainData);

// Let's get the model parameters from the model.
var modelParams = model.LastTransformer.Model;

// Let's inspect the model parameters.
var featureCount = modelParams.FeatureCount;
var fieldCount = modelParams.FieldCount;
var latentDim = modelParams.LatentDimension;
var linearWeights = modelParams.GetLinearWeights();
var latentWeights = modelParams.GetLatentWeights();

Console.WriteLine("The feature count is: " + featureCount);
Console.WriteLine("The number of fields is: " + fieldCount);
Console.WriteLine("The latent dimension is: " + latentDim);
Console.WriteLine("The linear weights of some of the features are: " +
string.Concat(Enumerable.Range(1, 10).Select(i => $"{linearWeights[i]:F4} ")));
Copy link
Member

@wschin wschin Feb 26, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string.Concats are not aligned. #Resolved

Console.WriteLine("The weights of some of the latent features are: " +
string.Concat(Enumerable.Range(1, 10).Select(i => $"{latentWeights[i]:F4} ")));

// The feature count is: 9374
// The number of fields is: 1
// The latent dimension is: 20
// The linear weights of some of the features are: 0.0410 0.0000 -0.0078 -0.0285 0.0000 0.0114 0.1313 0.0183 -0.0224 0.0166
// The weights of some of the latent features are: -0.0326 0.1127 0.0621 0.1446 0.2038 0.1608 0.2084 0.0141 0.2458 -0.0625

// Evaluate how the model is doing on the test data.
var dataWithPredictions = model.Transform(testData);

var metrics = mlContext.BinaryClassification.Evaluate(dataWithPredictions, "Sentiment");
SamplesUtils.ConsoleUtils.PrintMetrics(metrics);

// Accuracy: 0.78
// AUC: 0.81
// F1 Score: 0.78
// Negative Precision: 0.78
// Negative Recall: 0.78
// Positive Precision: 0.78
// Positive Recall: 0.78
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ public static void Example()
// Downloading the dataset from github.com/dotnet/machinelearning.
// This will create a sentiment.tsv file in the filesystem.
// You can open this file, if you want to see the data.
string dataFile = SamplesUtils.DatasetUtils.DownloadSentimentDataset();
string dataFile = SamplesUtils.DatasetUtils.DownloadSentimentDataset()[0];

// A preview of the data.
// Sentiment SentimentText
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ public static void Example()
// Downloading the dataset from github.com/dotnet/machinelearning.
// This will create a sentiment.tsv file in the filesystem.
// You can open this file, if you want to see the data.
string dataFile = SamplesUtils.DatasetUtils.DownloadSentimentDataset();
string dataFile = SamplesUtils.DatasetUtils.DownloadSentimentDataset()[0];

// A preview of the data.
// Sentiment SentimentText
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ public static void Example()
// Downloading the dataset from github.com/dotnet/machinelearning.
// This will create a sentiment.tsv file in the filesystem.
// You can open this file, if you want to see the data.
string dataFile = SamplesUtils.DatasetUtils.DownloadSentimentDataset();
string dataFile = SamplesUtils.DatasetUtils.DownloadSentimentDataset()[0];

// A preview of the data.
// Sentiment SentimentText
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
<NativeAssemblyReference Include="CpuMathNative" />
<NativeAssemblyReference Include="FastTreeNative" />
<NativeAssemblyReference Include="MatrixFactorizationNative" />
<NativeAssemblyReference Include="FactorizationMachineNative" />
<NativeAssemblyReference Include="LdaNative" />
<NativeAssemblyReference Include="SymSgdNative" />
<PackageReference Include="Microsoft.ML.TensorFlow.Redist" Version="0.10.0" />
Expand Down
44 changes: 39 additions & 5 deletions src/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs
Original file line number Diff line number Diff line change
Expand Up @@ -77,14 +77,48 @@ public sealed class HousingRegression
/// <summary>
/// Downloads the wikipedia detox dataset from the ML.NET repo.
/// </summary>
public static string DownloadSentimentDataset()
=> Download("https://raw.githubusercontent.com/dotnet/machinelearning/76cb2cdf5cc8b6c88ca44b8969153836e589df04/test/data/wikipedia-detox-250-line-data.tsv", "sentiment.tsv");
public static string[] DownloadSentimentDataset()
{
var trainFile = Download("https://raw.githubusercontent.com/dotnet/machinelearning/76cb2cdf5cc8b6c88ca44b8969153836e589df04/test/data/wikipedia-detox-250-line-data.tsv", "sentiment.tsv");
var testFile = Download("https://raw.githubusercontent.com/dotnet/machinelearning/76cb2cdf5cc8b6c88ca44b8969153836e589df04/test/data/wikipedia-detox-250-line-test.tsv", "sentimenttest.tsv");
return new[] { trainFile, testFile };
}

/// <summary>
/// Downloads the adult dataset from the ML.NET repo.
/// </summary>
public static string DownloadAdultDataset()
=> Download("https://raw.githubusercontent.com/dotnet/machinelearning/244a8c2ac832657af282aa312d568211698790aa/test/data/adult.train", "adult.txt");

/// <summary>
/// Downloads the adult dataset from the ML.NET repo.
/// Downloads the wikipedia detox dataset and featurizes it to be suitable for sentiment classification tasks.
/// </summary>
public static string DownloadAdultDataset()
=> Download("https://raw.githubusercontent.com/dotnet/machinelearning/244a8c2ac832657af282aa312d568211698790aa/test/data/adult.train", "adult.txt");
/// <param name="mlContext"><see cref="MLContext"/> used for data loading and processing.</param>
/// <returns>Featurized train and test dataset.</returns>
public static IDataView[] LoadFeaturizedSentimentDataset(MLContext mlContext)
{
// Download the files
var dataFiles = DownloadSentimentDataset();

// Define the columns to read
var reader = mlContext.Data.CreateTextLoader(
columns: new[]
{
new TextLoader.Column("Sentiment", DataKind.Boolean, 0),
new TextLoader.Column("SentimentText", DataKind.String, 1)
},
hasHeader: true
);

// Create data featurizing pipeline
var pipeline = mlContext.Transforms.Text.FeaturizeText("Features", "SentimentText");

var data = reader.Read(dataFiles[0]);
var model = pipeline.Fit(data);
var featurizedDataTrain = model.Transform(data);
var featurizedDataTest = model.Transform(reader.Read(dataFiles[1]));
return new[] { featurizedDataTrain, featurizedDataTest };
}

/// <summary>
/// Downloads the Adult UCI dataset and featurizes it to be suitable for classification tasks.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ public static class FactorizationMachineExtensions
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[FieldAwareFactorizationMachine](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/FieldAwareFactorizationMachine.cs)]
/// [!code-csharp[FieldAwareFactorizationMachine](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/BinaryClassification/FieldAwareFactorizationMachine.cs)]
/// ]]></format>
/// </example>
public static FieldAwareFactorizationMachineTrainer FieldAwareFactorizationMachine(this BinaryClassificationCatalog.BinaryClassificationTrainers catalog,
Expand All @@ -41,6 +41,12 @@ public static FieldAwareFactorizationMachineTrainer FieldAwareFactorizationMachi
/// </summary>
/// <param name="catalog">The binary classification catalog trainer object.</param>
/// <param name="options">Advanced arguments to the algorithm.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[FieldAwareFactorizationMachine](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Trainers/BinaryClassification/FieldAwareFactorizationMachineWithOptions.cs)]
/// ]]></format>
/// </example>
public static FieldAwareFactorizationMachineTrainer FieldAwareFactorizationMachine(this BinaryClassificationCatalog.BinaryClassificationTrainers catalog,
FieldAwareFactorizationMachineTrainer.Options options)
{
Expand Down
Loading