Multi classification - Probability #1881

mgolois · 2018-12-14T16:48:43Z

Hello,
I have an application that use using ml.net multiclassification trainer to predict a category. However, it seems as though we remove the TryGetScoreLabelNames() method from the library. Basically, the application would like to output as well how confident (in percentage) is the predicted label. How can I achieve that in ML.NET 0.8?

wschin · 2018-12-14T20:39:36Z

You can access Score column in the output.

mgolois · 2018-12-14T21:05:43Z

Yes, @wschin , I have the Score column. How can I use it to get the probability, or is there away to calculate it with Score ?

wschin · 2018-12-14T21:26:12Z

In most cases (99%), it's called Score. Do you have a sample code for showing how APIs get used? Here is a possible example.

// Number of examples
private const int _rowNumber = 1000;
// Number of features
private const int _columnNumber = 5;
// Number of classes
private const int _classNumber = 3;
private class GbmExample
{
    [VectorType(_columnNumber)]
    public float[] Features;
    [KeyType(Contiguous = true, Count =_classNumber, Min = 0)]
    public uint Label;
    [VectorType(_classNumber)]
    public float[] Score;
}

[ConditionalFact(typeof(Environment), nameof(Environment.Is64BitProcess))] // LightGBM is 64-bit only
public void LightGbmMultiClassEstimatorCompare()
{
    // Training matrix. It contains all feature vectors.
    var dataMatrix = new float[_rowNumber * _columnNumber];
    // Labels for multi-class classification
    var labels = new uint[_rowNumber];
    // Training list, which is equivalent to the training matrix above.
    var dataList = new List<GbmExample>();
    for (/*row index*/ int i = 0; i < _rowNumber; ++i)
    {
        int featureSum = 0;
        var featureVector = new float[_columnNumber];
        for (/*column index*/ int j = 0; j < _columnNumber; ++j)
        {
            int featureValue = (j + i * _columnNumber) % 10;
            featureSum += featureValue;
            dataMatrix[j + i * _columnNumber] = featureValue;
            featureVector[j] = featureValue;
        }
        labels[i] = (uint)featureSum % _classNumber;
        dataList.Add(new GbmExample { Features = featureVector, Label = labels[i], Score = new float[_classNumber] });
    }

    var mlContext = new MLContext(seed: 0, conc: 1);
    var dataView = ComponentCreation.CreateDataView(mlContext, dataList);
    var gbmTrainer = new LightGbmMulticlassTrainer(mlContext, labelColumn: "Label", featureColumn: "Features", numBoostRound: 3,
        advancedSettings: s => { s.MinDataPerGroup = 1; s.MinDataPerLeaf = 1; });
    var gbm = gbmTrainer.Fit(dataView);
    var predicted = gbm.Transform(dataView);
    var predictions = new List<GbmExample>(predicted.AsEnumerable<GbmExample>(mlContext, false));
}

mgolois · 2018-12-14T21:43:18Z

@wschin Here is my sample code:

static void Main(string[] args)
        {
            MLContext mlContext = new MLContext();
            string dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "all.csv");

            TextLoader textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()
            {
                Separator = ",",
                HasHeader = false,
                Column = new[]
                {
                    new TextLoader.Column("Description", DataKind.Text, 1),
                    new TextLoader.Column("Type", DataKind.Text, 2),
                    new TextLoader.Column("Category", DataKind.Text, 3),
                }
            });

            var data = textLoader.Read(dataPath);
            var (trainData, testData) = mlContext.MulticlassClassification.TrainTestSplit(data, testFraction: 0.2);

         
            var dataProcessingPipeline = mlContext.Transforms.Categorical.OneHotEncoding("Type", "TypeEncoded")
                 .Append(mlContext.Transforms.Text.FeaturizeText("Description", "DescriptionEncoded"))
                 .Append(mlContext.Transforms.Concatenate("Features", "Encoded", "DescriptionEncoded"))
                 .Append(mlContext.Transforms.Conversion.MapValueToKey("Category", "Label"))
                 .Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent())
                 .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

       

            var model = dataProcessingPipeline.Fit(trainData);

            var metrics = mlContext.MulticlassClassification.Evaluate(model.Transform(testData));
            Console.WriteLine($"Accuracy Micro: {metrics.AccuracyMicro}");
            Console.WriteLine($"Accuracy Micro: {metrics.AccuracyMacro}");
          
            var predictionFunction = model.MakePredictionFunction<CategoryDetail, CategoryPrediction>(mlContext);
            var cat = predictionFunction.Predict(new CategoryDetail
                                                {
                                                    Description = "this is a great jacket",
                                                    Type = "Women"
                                                });

           //Now I need to output to the user the prediction cat.Prediction as well as the probability of that prediction

            Console.ReadKey();
        }


    public class CategoryDetail
    {
        [Column("1")]
        public string Description;
        [Column("2")]
        public string Type;
        [Column("3")]
        public string Category;
    }

    public class CategoryPrediction
    {
        [ColumnName("PredictedLabel")]
        public string Prediction { get; set; }

        [ColumnName("Score")]
        public float[] Score { get; set; }
    }

wschin · 2018-12-14T21:56:44Z

Can you access Score field in cat?

mgolois · 2018-12-14T22:10:52Z

yes @wschin I can, it's just array for float. how to do I get a percentage out of it?

wschin · 2018-12-14T22:49:41Z

Scores[i] stores the probability of being the i-th class.

mgolois · 2018-12-14T23:10:38Z

@wschin Ok, how do I get the list of labels from model. There used to be an extension method from the model in 0.7.
If I have the list of label names from the model, I can do
index = labelNames.indexOf(predictedLabel) then I can confidently do
Scores[index]

OR

is it safe to assume that the maximum score was the predicted label?
scores.Max()

wschin · 2018-12-14T23:15:21Z

I'd expect that the predicted name (column name: PredictedLabel) is there too, Prediction in CategoryPrediction. You can do scores.Max() as well and then find the label index associated with the best probability.

polymorp · 2018-12-17T22:57:27Z

I would love to know how to do this too.
I can see the array of scores, but cannot be sure which label each Scores[index] relates to so whilst I know the percentage for the PredictedLabel result as that is the scores.max, I want to know the percentages for what came 2nd / 3rd etc. sometimes.

rauhs · 2018-12-18T07:25:03Z

IMO there should really be a public API to get the label names out of a model. Since TryGetScoreLabelNames is removed there is currently no easy way. I think that it's very common to get the Top-k predictions (sorted) when doing multi class prediction. Many of the big image classifications tasks (AlexNet etc) get compared by Top-5 accuracy.

What I'm currently doing is this:

    public static string[] GetKeyValues(this Schema.Column column)
    {
      var metadata = column.Metadata;
      var labels = default(VBuffer<ReadOnlyMemory<char>>);
      metadata.GetValue(MetadataUtils.Kinds.KeyValues, ref labels);
      var names = new string[labels.Length];
      int index = 0;
      foreach (var label in labels.DenseValues())
        names[index++] = label.ToString();

      return names;
    }

And then doing something like this to get the label names:

      var encodedSample = encoder.Transform(ctx.CreateDataView(new[] { new YouInputInstanceClass() }));
      var encodedSchema = encodedSample.Schema;
      var labelNames = encodedSchema[LabelLabel].GetKeyValues();

Could we get a public API so that we can get the label names back out? Especially once you safe the model to a file and don't have a big training data-set loaded you'd also need to get the label names out of the model somehow.

wschin · 2018-12-18T21:29:07Z

Thanks a lot. This sounds a good idea but we need to investigate to see if doable. In your solution, the KeyValue attribute is accessed because
(1) you know that's the label column
(2) its metadata was not corrupted.
Some feature engineering steps don't preserve metadata, At least, we need to properly propogate metadata at each step in a pipeline and then calling

       /// <summary>
        /// Add slot names metadata.
        /// </summary>
        /// <param name="size">The size of the slot names vector.</param>
        /// <param name="getter">The getter delegate for the slot names.</param>
        public void AddSlotNames(int size, ValueGetter<VBuffer<ReadOnlyMemory<char>>> getter)
            => Add(MetadataUtils.Kinds.SlotNames, new VectorType(TextType.Instance, size), getter);

in each classifier to attach label names to their scores.

karllarssonusawest · 2018-12-21T23:32:45Z

Just jumping in to agree that returning the top N predictions with matching score is vital, something we've lost with the removal of 'TryGetScoreLabelNames'.

I'm able to do it when training the model and have all the data to hand (IDataView dataView parameter below) :

private static bool TryGetScoreLabelNames(TransformerChain<KeyToValueMappingTransformer> model, IDataView dataView, out string[] names, string scoreColumnName = DefaultColumnNames.Score)
        {
            names = (string[])null;
            Schema outputSchema = model.GetOutputSchema(dataView.Schema);
            int col = -1;
            if (!outputSchema.TryGetColumnIndex(scoreColumnName, out col))
            {
                return false;
            } 
VectorType valueCount = (VectorType)outputSchema.GetColumnType(col);

            if (!outputSchema.HasSlotNames(col, valueCount.Size))
            {
                return false;
            }

            VBuffer<ReadOnlyMemory<char>> vbuffer = new VBuffer<ReadOnlyMemory<char>>();
            outputSchema.GetMetadata<VBuffer<ReadOnlyMemory<char>>>("SlotNames", col, ref vbuffer);
            if (vbuffer.Length != valueCount.Size)
            {
                return false;
            }
            names = new string[valueCount.Size];
            int num = 0;
            foreach (ReadOnlyMemory<char> denseValue in vbuffer.DenseValues())
            {
                names[num++] = denseValue.ToString();
            }
            return true;
        }

However, I haven't figured out how to do that when I'm loading the model later on, to make a single prediction. At that point, I don't have an IDataView object I can pass into the function,

wschin · 2018-12-22T01:22:35Z

I just made an example for extracting label names from learned pipeline. Please take a look at #1953.

karllarssonusawest · 2018-12-28T22:54:25Z

Thanks, wschin - I couldn't translate your example into something that would work for me without a lot of retrofitting, but I did stumble on my own solution, I think. I keep the 'TryGetScoreLabelNames' function as is, but instead of using model.MakePredictionFunction to make my prediction, I broke that back down into its individual steps:

var inputData = mlContext.CreateDataView(new ClassOfItemToPredict[] { itemToPredict});
var outputData = _trainedModel.Transform(inputData);
var prediction = outputData.AsEnumerable<PredictionClass>(mlContext, reuseRowObject: false).Single();

 //Get labels
string[] scoresLabels;
if (TryGetScoreLabelNames(_trainedModel, outputData, out scoresLabels)) { };

That then gives me the IDataView I can pass into TryGetScoreLabelNames to get the schema etc...

Ivanidzo4ka · 2019-01-31T01:50:38Z

Just a heads up.
It's still a bit complicated and required work with metadata which you already do, but in 0.10 prediction function will contain OutputSchema.
https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Tests/Scenarios/Api/Estimators/PredictAndMetadata.cs
So that would slightly ease process of getting information. Work with metadata is still a mess, and we need to come up with more user friendly way to access it. But at least now, you don't need to do extra steps to access schema.

biztechprogramming · 2019-02-27T16:07:32Z

I looked at the sample code that supposedly closes this issue. It is extremely ugly code. This issue should not be closed.

tangbinbinyes · 2021-04-17T11:06:54Z

Scores[i] 存储成为第i类的概率。

Thanks.But how do I know which type of category "i" is?

singlis added the question Further information is requested label Dec 14, 2018

wschin mentioned this issue Dec 22, 2018

Add an example for static pipeline with in-memory data and show how to get class probabilities #1953

Merged

pekspro mentioned this issue Jan 1, 2019

Why do I need to add a property to my class to get Score? #1991

Closed

wschin closed this as completed in #1953 Jan 8, 2019

ghost locked as resolved and limited conversation to collaborators Mar 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi classification - Probability #1881

Multi classification - Probability #1881

mgolois commented Dec 14, 2018

wschin commented Dec 14, 2018

mgolois commented Dec 14, 2018

wschin commented Dec 14, 2018

mgolois commented Dec 14, 2018 •

edited by wschin

Loading

wschin commented Dec 14, 2018

mgolois commented Dec 14, 2018 •

edited

Loading

wschin commented Dec 14, 2018

mgolois commented Dec 14, 2018 •

edited

Loading

wschin commented Dec 14, 2018 •

edited

Loading

polymorp commented Dec 17, 2018

rauhs commented Dec 18, 2018

wschin commented Dec 18, 2018

karllarssonusawest commented Dec 21, 2018

wschin commented Dec 22, 2018

karllarssonusawest commented Dec 28, 2018

Ivanidzo4ka commented Jan 31, 2019

biztechprogramming commented Feb 27, 2019

tangbinbinyes commented Apr 17, 2021

Multi classification - Probability #1881

Multi classification - Probability #1881

Comments

mgolois commented Dec 14, 2018

wschin commented Dec 14, 2018

mgolois commented Dec 14, 2018

wschin commented Dec 14, 2018

mgolois commented Dec 14, 2018 • edited by wschin Loading

wschin commented Dec 14, 2018

mgolois commented Dec 14, 2018 • edited Loading

wschin commented Dec 14, 2018

mgolois commented Dec 14, 2018 • edited Loading

wschin commented Dec 14, 2018 • edited Loading

polymorp commented Dec 17, 2018

rauhs commented Dec 18, 2018

wschin commented Dec 18, 2018

karllarssonusawest commented Dec 21, 2018

wschin commented Dec 22, 2018

karllarssonusawest commented Dec 28, 2018

Ivanidzo4ka commented Jan 31, 2019

biztechprogramming commented Feb 27, 2019

tangbinbinyes commented Apr 17, 2021

mgolois commented Dec 14, 2018 •

edited by wschin

Loading

mgolois commented Dec 14, 2018 •

edited

Loading

mgolois commented Dec 14, 2018 •

edited

Loading

wschin commented Dec 14, 2018 •

edited

Loading