Skip to content

Added samples: GitHubLabeler and GettingStarted #198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp2.0</TargetFramework>
</PropertyGroup>

<PropertyGroup Condition="'$(Configuration)|$(Platform)'=='Debug|AnyCPU'">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PropertyGroup is incorrect, as the property below won't be set for Release builds.

If you want to set <LangVersion> to latest, it should be done unconditionally. You can move <LangVersion>latest</LangVersion> to the above PropertyGroup.

<LangVersion>latest</LangVersion>
</PropertyGroup>

<ItemGroup>
<PackageReference Include="Microsoft.ML" Version="0.1.0" />
</ItemGroup>

<ItemGroup>
<Folder Include="Models\" />
</ItemGroup>

<ItemGroup>
<None Update="Models\SentimentModel.zip">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>

</Project>
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
using Microsoft.ML;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System;
using System.IO;
using System.Linq;
using System.Threading.Tasks;

namespace BinaryClassification_SentimentAnalysis
{
internal static class Program
{
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string TrainDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "imdb_labelled.txt");
private static string TestDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "yelp_labelled.txt");
private static string ModelPath => Path.Combine(AppPath, "Models", "SentimentModel.zip");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Models", "SentimentModel.zip"); [](start = 65, length = 32)

I've noticed you have I think unintentionally also put the model files (the .zips) in your PR. I think that was unintentional, and is just an artifafct of the run. Either way we have the mechanism to create this artifact here, and we generally try to avoid checking in binary artifacts if it can be helped. Could you remove them?


private static async Task Main(string[] args)
{
var model = await TrainAsync();

Evaluate(model);

var predictions = model.Predict(TestSentimentData.Sentiments);

var sentimentsAndPredictions =
TestSentimentData.Sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));
foreach (var item in sentimentsAndPredictions)
{
Console.WriteLine(
$"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")} sentiment");
}

Console.ReadLine();
}

public static async Task<PredictionModel<SentimentData, SentimentPrediction>> TrainAsync()
{
// LearningPipeline allows us to add steps in order to keep everything together
// during the learning process.
var pipeline = new LearningPipeline();

// The TextLoader loads a dataset with comments and corresponding postive or negative sentiment.
// When you create a loader you specify the schema by passing a class to the loader containing
// all the column names and their types. This will be used to create the model, and train it.
pipeline.Add(new TextLoader<SentimentData>(TrainDataPath, useHeader: false, separator: "tab"));

// TextFeaturizer is a transform that will be used to featurize an input column.
// This is used to format and clean the data.
pipeline.Add(new TextFeaturizer("Features", "SentimentText"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good opportunity to show off some of the hyperparameters of the TextFeaturizer.

pipeline.Add(new TextFeaturizer("Features", "SentimentText")
{
    WordFeatureExtractor = new NGramNgramExtractor() { NgramLength = 2, AllLengths = true },
    CharFeatureExtractor = new NGramNgramExtractor() { NgramLength = 3, AllLengths = false }      
});

Others of course are available:

pipeline.Add(new TextFeaturizer("Features", "SentimentText")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense in other samples we are adding or going to add. For getting started sample the easier the better.


//add a FastTreeBinaryClassifier, the decision tree learner for this project, and
//three hyperparameters to be used for tuning decision tree performance
pipeline.Add(new FastTreeBinaryClassifier() {NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend Bigrams+Trichargrams w/ AveragedPerceptronBinaryClassifier{iter=10} for text. AveragedPerceptron generally wins on text vs. FastTree in terms of accuracy & speed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I will use them for advance scenario samples. This one is a "Hello world" type of sample where I'm trying to make the pipeline as simple as possible. And in the next samples I'll demonstrate how the results can be improved with different transforms.


Console.WriteLine("=============== Training model ===============");
// We train our pipeline based on the dataset that has been loaded and transformed
var model = pipeline.Train<SentimentData, SentimentPrediction>();

await model.WriteAsync(ModelPath);

Console.WriteLine("=============== End training ===============");
Console.WriteLine("The model is saved to {0}", ModelPath);

return model;
}

private static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
{
var testData = new TextLoader<SentimentData>(TestDataPath, useHeader: true, separator: "tab");

// BinaryClassificationEvaluator computes the quality metrics for the PredictionModel
//using the specified data set.
var evaluator = new BinaryClassificationEvaluator();

Console.WriteLine("=============== Evaluating model ===============");

// BinaryClassificationMetrics contains the overall metrics computed by binary classification evaluators
var metrics = evaluator.Evaluate(model, testData);

// The Accuracy metric gets the accuracy of a classifier which is the proportion
//of correct predictions in the test set.

// The Auc metric gets the area under the ROC curve.
// The area under the ROC curve is equal to the probability that the classifier ranks
// a randomly chosen positive instance higher than a randomly chosen negative one
// (assuming 'positive' ranks higher than 'negative').

// The F1Score metric gets the classifier's F1 score.
// The F1 score is the harmonic mean of precision and recall:
// 2 * precision * recall / (precision + recall).

Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End evaluating ===============");
Console.WriteLine();
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
using Microsoft.ML.Runtime.Api;

namespace BinaryClassification_SentimentAnalysis
{
public class SentimentData
{
[Column("0")] public string SentimentText;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest putting the attributes on separate lines as that's more in line with default formatting.


[Column("1", name: "Label")] public float Sentiment;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
using Microsoft.ML.Runtime.Api;

namespace BinaryClassification_SentimentAnalysis
{
public class SentimentPrediction
{
[ColumnName("PredictedLabel")] public bool Sentiment;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
using System.Collections.Generic;

namespace BinaryClassification_SentimentAnalysis
{
internal class TestSentimentData
{
internal static readonly IEnumerable<SentimentData> Sentiments = new[]
{
new SentimentData
{
SentimentText = "Contoso's 11 is a wonderful experience",
Sentiment = 0
},
new SentimentData
{
SentimentText = "The acting in this movie is very bad",
Sentiment = 0
},
new SentimentData
{
SentimentText = "Joe versus the Volcano Coffee Company is a great film.",
Sentiment = 0
}
};
}
}
37 changes: 37 additions & 0 deletions Samples/GettingStarted/GettingStarted.sln
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio 15
VisualStudioVersion = 15.0.27703.2000
MinimumVisualStudioVersion = 10.0.40219.1
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "Regression_TaxiFarePrediction", "Regression_TaxiFarePrediction\Regression_TaxiFarePrediction.csproj", "{C7301D08-10E3-4A51-A70D-7C0BCB39F6E6}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "BinaryClassification_SentimentAnalysis", "BinaryClassification_SentimentAnalysis\BinaryClassification_SentimentAnalysis.csproj", "{ED877F56-5304-4F0D-A75C-4C77219C8D0E}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "MulticlassClassification_Iris", "MulticlassClassification_Iris\MulticlassClassification_Iris.csproj", "{EEC2E07E-7482-4F37-8F7A-135EBDEC75B4}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Release|Any CPU = Release|Any CPU
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{C7301D08-10E3-4A51-A70D-7C0BCB39F6E6}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{C7301D08-10E3-4A51-A70D-7C0BCB39F6E6}.Debug|Any CPU.Build.0 = Debug|Any CPU
{C7301D08-10E3-4A51-A70D-7C0BCB39F6E6}.Release|Any CPU.ActiveCfg = Release|Any CPU
{C7301D08-10E3-4A51-A70D-7C0BCB39F6E6}.Release|Any CPU.Build.0 = Release|Any CPU
{ED877F56-5304-4F0D-A75C-4C77219C8D0E}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{ED877F56-5304-4F0D-A75C-4C77219C8D0E}.Debug|Any CPU.Build.0 = Debug|Any CPU
{ED877F56-5304-4F0D-A75C-4C77219C8D0E}.Release|Any CPU.ActiveCfg = Release|Any CPU
{ED877F56-5304-4F0D-A75C-4C77219C8D0E}.Release|Any CPU.Build.0 = Release|Any CPU
{EEC2E07E-7482-4F37-8F7A-135EBDEC75B4}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{EEC2E07E-7482-4F37-8F7A-135EBDEC75B4}.Debug|Any CPU.Build.0 = Debug|Any CPU
{EEC2E07E-7482-4F37-8F7A-135EBDEC75B4}.Release|Any CPU.ActiveCfg = Release|Any CPU
{EEC2E07E-7482-4F37-8F7A-135EBDEC75B4}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
EndGlobalSection
GlobalSection(ExtensibilityGlobals) = postSolution
SolutionGuid = {B84E804C-06CA-45C8-9B9F-8F69CA930535}
EndGlobalSection
EndGlobal
17 changes: 17 additions & 0 deletions Samples/GettingStarted/MulticlassClassification_Iris/IrisData.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
using Microsoft.ML.Runtime.Api;

namespace MulticlassClassification_Iris
{
public class IrisData
{
[Column("0")] public float Label;

[Column("1")] public float SepalLength;

[Column("2")] public float SepalWidth;

[Column("3")] public float PetalLength;

[Column("4")] public float PetalWidth;
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
using Microsoft.ML.Runtime.Api;

namespace MulticlassClassification_Iris
{
public class IrisPrediction
{
[ColumnName("Score")] public float[] Score;
}
}
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
<Project Sdk="Microsoft.NET.Sdk">

<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>netcoreapp2.0</TargetFramework>
</PropertyGroup>

<ItemGroup>
<PackageReference Include="Microsoft.ML" Version="0.1.0" />
</ItemGroup>

<ItemGroup>
<Folder Include="Models\" />
</ItemGroup>

<ItemGroup>
<None Update="Models\IrisModel.zip">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</None>
</ItemGroup>

</Project>
103 changes: 103 additions & 0 deletions Samples/GettingStarted/MulticlassClassification_Iris/Program.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
using System;
using System.IO;
using Microsoft.ML.Models;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using Microsoft.ML;
using System.Threading.Tasks;

namespace MulticlassClassification_Iris
{
public static partial class Program
{
private static string AppPath => Path.GetDirectoryName(Environment.GetCommandLineArgs()[0]);
private static string TrainDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "iris_train.txt");
private static string TestDataPath => Path.Combine(AppPath, @"..\..\..\..\datasets\", "iris_test.txt");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't use backslashes as this doesn't work cross-platform. You should do:

Path.Combine(AppPath, "..", "..", "..", "..", "datasets", "iris_train.txt");

private static string ModelPath => Path.Combine(AppPath, "Models", "IrisModel.zip");

private static async Task Main(string[] args)
{
var model = await TrainAsync();

Evaluate(model);

Console.WriteLine();
var prediction = model.Predict(TestIrisData.Iris1);
Console.WriteLine($"Actual: type 1. Predicted probability: type 1: {prediction.Score[0]:0.####}");
Console.WriteLine($" type 2: {prediction.Score[1]:0.####}");
Console.WriteLine($" type 3: {prediction.Score[2]:0.####}");
Console.WriteLine();

prediction = model.Predict(TestIrisData.Iris2);
Console.WriteLine($"Actual: type 3. Predicted probability: type 2: {prediction.Score[0]:0.####}");
Console.WriteLine($" type 2: {prediction.Score[1]:0.####}");
Console.WriteLine($" type 3: {prediction.Score[2]:0.####}");
Console.WriteLine();

prediction = model.Predict(TestIrisData.Iris3);
Console.WriteLine($"Actual: type 2. Predicted probability: type 1: {prediction.Score[0]:0.####}");
Console.WriteLine($" type 2: {prediction.Score[1]:0.####}");
Console.WriteLine($" type 3: {prediction.Score[2]:0.####}");

Console.ReadLine();
}

internal static async Task<PredictionModel<IrisData, IrisPrediction>> TrainAsync()
{
var pipeline = new LearningPipeline
{
new TextLoader<IrisData>(TrainDataPath, useHeader: false),
new ColumnConcatenator("Features",
"SepalLength",
"SepalWidth",
"PetalLength",
"PetalWidth"),
new StochasticDualCoordinateAscentClassifier()
};

Console.WriteLine("=============== Training model ===============");

var model = pipeline.Train<IrisData, IrisPrediction>();

await model.WriteAsync(ModelPath);

Console.WriteLine("=============== End training ===============");
Console.WriteLine("The model is saved to {0}", ModelPath);

return model;
}

private static void Evaluate(PredictionModel<IrisData, IrisPrediction> model)
{
var testData = new TextLoader<IrisData>(TestDataPath, useHeader: false);

var evaluator = new ClassificationEvaluator {OutputTopKAcc = 3};

Console.WriteLine("=============== Evaluating model ===============");

var metrics = evaluator.Evaluate(model, testData);
Console.WriteLine("Metrics:");
Console.WriteLine($" AccuracyMacro = {metrics.AccuracyMacro:0.####}, a value between 0 and 1, the closer to 1, the better");
Console.WriteLine($" AccuracyMicro = {metrics.AccuracyMicro:0.####}, a value between 0 and 1, the closer to 1, the better");
Console.WriteLine($" LogLoss = {metrics.LogLoss:0.####}, the closer to 0, the better");
Console.WriteLine($" LogLoss for class 1 = {metrics.PerClassLogLoss[0]:0.####}, the closer to 0, the better");
Copy link
Contributor

@justinormont justinormont May 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to get the actual name for class 1? When a user runs on their own dataset, they will likely be interested in their actual class names.

Console.WriteLine($" LogLoss for class 2 = {metrics.PerClassLogLoss[1]:0.####}, the closer to 0, the better");
Console.WriteLine($" LogLoss for class 3 = {metrics.PerClassLogLoss[2]:0.####}, the closer to 0, the better");
Console.WriteLine();
Console.WriteLine($" ConfusionMatrix:");

// Print confusion matrix
for (var i = 0; i < metrics.ConfusionMatrix.Order; i++)
{
for (var j = 0; j < metrics.ConfusionMatrix.ClassNames.Count; j++)
{
Console.Write("\t" + metrics.ConfusionMatrix[i, j] + "\t");
}
Console.WriteLine();
}

Console.WriteLine("=============== End evaluating ===============");
Console.WriteLine();
}
}
}
Loading