Skip to content

Add an example for static pipeline with in-memory data and show how to get class probabilities #1953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 8, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
using Microsoft.ML.Data;
using Microsoft.ML.LightGBM.StaticPipe;
using Microsoft.ML.SamplesUtils;
using Microsoft.ML.StaticPipe;
using System;
using System.Collections.Generic;
using System.Linq;

namespace Microsoft.ML.Samples.Static
{
class LightGBMMulticlassWithInMemoryData
Copy link
Member

@sfilipi sfilipi Jan 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LightGBMMulticlassWithInMemoryData [](start = 10, length = 34)

this needs to be referenced from some other file, otherwise it won't display in the documentation.

Maybe from the LightGBMStatics catalog, if we have one already? #Closed

Copy link
Member Author

@wschin wschin Jan 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do it following pattern for MF. Please take a look at LightGbmStaticExtensions.cs in the next iteration.


In reply to: 244876344 [](ancestors = 244876344)

{
public void MultiClassLightGbmStaticPipelineWithInMemoryData()
Copy link
Member

@sfilipi sfilipi Jan 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiClassLightGbmStaticPipelineWithInMemoryData [](start = 20, length = 48)

i think most of the emphasis in 1.0 is on the dynamic API. Would this example add value as a dynamic sample? #WontFix

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in another PR if really needed.


In reply to: 244876215 [](ancestors = 244876215)

{
// Create a general context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness.
var mlContext = new MLContext();

// Create in-memory examples as C# native class.
var examples = DatasetUtils.GenerateRandomMulticlassClassificationExamples(1000);

// Convert native C# class to IDataView, a consumble format to ML.NET functions.
var dataView = ComponentCreation.CreateDataView(mlContext, examples);

// IDataView is the data format used in dynamic-typed pipeline. To use static-typed pipeline, we need to convert
// IDataView to DataView by calling AssertStatic(...). The basic idea is to specify the static type for each column
// in IDataView in a lambda function.
var staticDataView = dataView.AssertStatic(mlContext, c => (
Features: c.R4.Vector,
Label: c.Text.Scalar));

// Create static pipeline. First, we make an estimator out of static DataView as the starting of a pipeline.
// Then, we append necessary transforms and a classifier to the starting estimator.
var pipe = staticDataView.MakeNewEstimator()
.Append(mapper: r => (
r.Label,
// Train multi-class LightGBM. The trained model maps Features to Label and probability of each class.
// The call of ToKey() is needed to convert string labels to integer indexes.
Predictions: mlContext.MulticlassClassification.Trainers.LightGbm(r.Label.ToKey(), r.Features)
))
.Append(r => (
// Actual label.
r.Label,
// Labels are converted to keys when training LightGBM so we convert it here again for calling evaluation function.
LabelIndex: r.Label.ToKey(),
// Used to compute metrics such as accuracy.
r.Predictions,
// Assign a new name to predicted class index.
PredictedLabelIndex: r.Predictions.predictedLabel,
// Assign a new name to class probabilities.
Scores: r.Predictions.score
));

// Split the static-typed data into training and test sets. Only training set is used in fitting
// the created pipeline. Metrics are computed on the test.
var (trainingData, testingData) = mlContext.MulticlassClassification.TrainTestSplit(staticDataView, testFraction: 0.5);

// Train the model.
var model = pipe.Fit(trainingData);

// Do prediction on the test set.
var prediction = model.Transform(testingData);

// Evaluate the trained model is the test set.
var metrics = mlContext.MulticlassClassification.Evaluate(prediction, r => r.LabelIndex, r => r.Predictions);

// Check if metrics are resonable.
Console.WriteLine ("Macro accuracy: {0}, Micro accuracy: {1}.", 0.863482146891263, 0.86309523809523814);

// Convert prediction in ML.NET format to native C# class.
var nativePredictions = new List<DatasetUtils.MulticlassClassificationExample>(prediction.AsDynamic.AsEnumerable<DatasetUtils.MulticlassClassificationExample>(mlContext, false));

// Get schema object out of the prediction. It contains metadata such as the mapping from predicted label index
// (e.g., 1) to its actual label (e.g., "AA"). The call to "AsDynamic" converts our statically-typed pipeline into
// a dynamically-typed one only for extracting metadata. In the future, metadata in statically-typed pipeline should
// be accessible without dynamically-typed things.
var schema = prediction.AsDynamic.Schema;

// Retrieve the mapping from labels to label indexes.
var labelBuffer = new VBuffer<ReadOnlyMemory<char>>();
schema[nameof(DatasetUtils.MulticlassClassificationExample.PredictedLabelIndex)].Metadata.GetValue("KeyValues", ref labelBuffer);
// nativeLabels is { "AA" , "BB", "CC", "DD" }
var nativeLabels = labelBuffer.DenseValues().ToArray(); // nativeLabels[nativePrediction.PredictedLabelIndex - 1] is the original label indexed by nativePrediction.PredictedLabelIndex.


// Show prediction result for the 3rd example.
var nativePrediction = nativePredictions[2];
// Console output:
// Our predicted label to this example is "AA" with probability 0.922597349.
Console.WriteLine("Our predicted label to this example is {0} with probability {1}",
Copy link
Member

@sfilipi sfilipi Jan 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Console.WriteLine(" [](start = 11, length = 20)

have a comment for the WriteLines showing here directly what the output would be. #Closed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem. Will do the same for Line 159 below.


In reply to: 244876086 [](ancestors = 244876086)

nativeLabels[(int)nativePrediction.PredictedLabelIndex - 1],
nativePrediction.Scores[(int)nativePrediction.PredictedLabelIndex - 1]);

var expectedProbabilities = new float[] { 0.922597349f, 0.07508608f, 0.00221699756f, 9.95488E-05f };
// Scores and nativeLabels are two parallel attributes; that is, Scores[i] is the probability of being nativeLabels[i].
// Console output:
// The probability of being class "AA" is 0.922597349.
// The probability of being class "BB" is 0.07508608.
// The probability of being class "CC" is 0.00221699756.
// The probability of being class "DD" is 9.95488E-05.
for (int i = 0; i < labelBuffer.Length; ++i)
Console.WriteLine("The probability of being class {0} is {1}.", nativeLabels[i], nativePrediction.Scores[i]);
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,13 @@ public static Scalar<float> LightGbm<TVal>(this RankingContext.RankingTrainers c
/// the linear model that was trained. Note that this action cannot change the
/// result in any way; it is only a way for the caller to be informed about what was learnt.</param>
/// <returns>The set of output columns including in order the predicted per-class likelihoods (between 0 and 1, and summing up to 1), and the predicted label.</returns>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[MF](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Static/LightGBMMulticlassWithInMemoryData.cs)]
/// ]]>
/// </format>
/// </example>
public static (Vector<float> score, Key<uint, TVal> predictedLabel)
LightGbm<TVal>(this MulticlassClassificationContext.MulticlassClassificationTrainers ctx,
Key<uint, TVal> label,
Expand Down
63 changes: 63 additions & 0 deletions src/Microsoft.ML.SamplesUtils/SamplesDatasetUtils.cs
Original file line number Diff line number Diff line change
Expand Up @@ -237,5 +237,68 @@ public static IEnumerable<SampleVectorOfNumbersData> GetVectorOfNumbersData()
});
return data;
}

/// <summary>
/// feature vector's length in <see cref="MulticlassClassificationExample"/>.
/// </summary>
private const int _featureVectorLength = 10;

public class MulticlassClassificationExample
{
[VectorType(_featureVectorLength)]
public float[] Features;
[ColumnName("Label")]
public string Label;
public uint LabelIndex;
public uint PredictedLabelIndex;
[VectorType(4)]
// The probabilities of being "AA", "BB", "CC", and "DD".
public float[] Scores;

public MulticlassClassificationExample()
{
Features = new float[_featureVectorLength];
}
}

/// <summary>
/// Helper function used to generate random <see cref="GenerateRandomMulticlassClassificationExamples"/>s.
/// </summary>
/// <param name="count">Number of generated examples.</param>
/// <returns>A list of random examples.</returns>
public static List<MulticlassClassificationExample> GenerateRandomMulticlassClassificationExamples(int count)
{
var examples = new List<MulticlassClassificationExample>();
var rnd = new Random(0);
for (int i = 0; i < count; ++i)
{
var example = new MulticlassClassificationExample();
var res = i % 4;
// Generate random float feature values.
for (int j = 0; j < _featureVectorLength; ++j)
{
var value = (float)rnd.NextDouble() + res * 0.2f;
example.Features[j] = value;
}

// Generate label based on feature sum.
if (res == 0)
example.Label = "AA";
else if (res == 1)
example.Label = "BB";
else if (res == 2)
example.Label = "CC";
else
example.Label = "DD";

// The following three attributes are just placeholder for storing prediction results.
example.LabelIndex = default;
example.PredictedLabelIndex = default;
example.Scores = new float[4];

examples.Add(example);
}
return examples;
}
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
<ProjectReference Include="..\..\src\Microsoft.ML.ImageAnalytics\Microsoft.ML.ImageAnalytics.csproj" />
<ProjectReference Include="..\..\src\Microsoft.ML.LightGBM.StaticPipe\Microsoft.ML.LightGBM.StaticPipe.csproj" />
<ProjectReference Include="..\..\src\Microsoft.ML.LightGBM\Microsoft.ML.LightGBM.csproj" />
<ProjectReference Include="..\..\src\Microsoft.ML.SamplesUtils\Microsoft.ML.SamplesUtils.csproj" />
<ProjectReference Include="..\..\src\Microsoft.ML.StandardLearners\Microsoft.ML.StandardLearners.csproj" />
<ProjectReference Include="..\..\src\Microsoft.ML.StaticPipe\Microsoft.ML.StaticPipe.csproj" />
<ProjectReference Include="..\Microsoft.ML.TestFramework\Microsoft.ML.TestFramework.csproj" />
Expand Down
86 changes: 86 additions & 0 deletions test/Microsoft.ML.StaticPipelineTesting/Training.cs
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
// See the LICENSE file in the project root for more information.

using System;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
Expand All @@ -13,6 +14,7 @@
using Microsoft.ML.LightGBM;
using Microsoft.ML.LightGBM.StaticPipe;
using Microsoft.ML.RunTests;
using Microsoft.ML.SamplesUtils;
using Microsoft.ML.StaticPipe;
using Microsoft.ML.Trainers;
using Microsoft.ML.Trainers.FastTree;
Expand Down Expand Up @@ -1009,5 +1011,89 @@ public void MatrixFactorization()
// Naive test. Just make sure the pipeline runs.
Assert.InRange(metrics.L2, 0, 0.5);
}

[ConditionalFact(typeof(Environment), nameof(Environment.Is64BitProcess))] // LightGBM is 64-bit only
public void MultiClassLightGbmStaticPipelineWithInMemoryData()
{
// Create a general context for ML.NET operations. It can be used for exception tracking and logging,
// as a catalog of available operations and as the source of randomness.
var mlContext = new MLContext(seed: 1, conc: 1);

// Create in-memory examples as C# native class.
var examples = DatasetUtils.GenerateRandomMulticlassClassificationExamples(1000);

// Convert native C# class to IDataView, a consumble format to ML.NET functions.
var dataView = ComponentCreation.CreateDataView(mlContext, examples);

// IDataView is the data format used in dynamic-typed pipeline. To use static-typed pipeline, we need to convert
// IDataView to DataView by calling AssertStatic(...). The basic idea is to specify the static type for each column
// in IDataView in a lambda function.
var staticDataView = dataView.AssertStatic(mlContext, c => (
Features: c.R4.Vector,
Label: c.Text.Scalar));

// Create static pipeline. First, we make an estimator out of static DataView as the starting of a pipeline.
// Then, we append necessary transforms and a classifier to the starting estimator.
var pipe = staticDataView.MakeNewEstimator()
.Append(mapper: r => (
r.Label,
// Train multi-class LightGBM. The trained model maps Features to Label and probability of each class.
// The call of ToKey() is needed to convert string labels to integer indexes.
Predictions: mlContext.MulticlassClassification.Trainers.LightGbm(r.Label.ToKey(), r.Features)
))
.Append(r => (
// Actual label.
r.Label,
// Labels are converted to keys when training LightGBM so we convert it here again for calling evaluation function.
LabelIndex: r.Label.ToKey(),
// Used to compute metrics such as accuracy.
r.Predictions,
// Assign a new name to predicted class index.
PredictedLabelIndex: r.Predictions.predictedLabel,
// Assign a new name to class probabilities.
Scores: r.Predictions.score
));

// Split the static-typed data into training and test sets. Only training set is used in fitting
// the created pipeline. Metrics are computed on the test.
var (trainingData, testingData) = mlContext.MulticlassClassification.TrainTestSplit(staticDataView, testFraction: 0.5);

// Train the model.
var model = pipe.Fit(trainingData);

// Do prediction on the test set.
var prediction = model.Transform(testingData);

// Evaluate the trained model is the test set.
var metrics = mlContext.MulticlassClassification.Evaluate(prediction, r => r.LabelIndex, r => r.Predictions);

// Check if metrics are resonable.
Assert.Equal(0.863482146891263, metrics.AccuracyMacro, 6);
Assert.Equal(0.86309523809523814, metrics.AccuracyMicro, 6);

// Convert prediction in ML.NET format to native C# class.
var nativePredictions = new List<DatasetUtils.MulticlassClassificationExample>(prediction.AsDynamic.AsEnumerable<DatasetUtils.MulticlassClassificationExample>(mlContext, false));

// Get schema object of the prediction. It contains metadata such as the mapping from predicted label index
// (e.g., 1) to its actual label (e.g., "AA").
var schema = prediction.AsDynamic.Schema;

// Retrieve the mapping from labels to label indexes.
var labelBuffer = new VBuffer<ReadOnlyMemory<char>>();
schema[nameof(DatasetUtils.MulticlassClassificationExample.PredictedLabelIndex)].Metadata.GetValue("KeyValues", ref labelBuffer);
var nativeLabels = labelBuffer.DenseValues().ToList(); // nativeLabels[nativePrediction.PredictedLabelIndex-1] is the original label indexed by nativePrediction.PredictedLabelIndex.

// Show prediction result for the 3rd example.
var nativePrediction = nativePredictions[2];
var expectedProbabilities = new float[] { 0.922597349f, 0.07508608f, 0.00221699756f, 9.95488E-05f };
// Scores and nativeLabels are two parallel attributes; that is, Scores[i] is the probability of being nativeLabels[i].
for (int i = 0; i < labelBuffer.Length; ++i)
Assert.Equal(expectedProbabilities[i], nativePrediction.Scores[i], 6);

// The predicted label below should be with probability 0.922597349.
Console.WriteLine("Our predicted label to this example is {0} with probability {1}",
nativeLabels[(int)nativePrediction.PredictedLabelIndex-1],
nativePrediction.Scores[(int)nativePrediction.PredictedLabelIndex-1]);
}
}
}