Skip to content

Multi classification - Probability #1881

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mgolois opened this issue Dec 14, 2018 · 18 comments · Fixed by #1953
Closed

Multi classification - Probability #1881

mgolois opened this issue Dec 14, 2018 · 18 comments · Fixed by #1953
Labels
question Further information is requested

Comments

@mgolois
Copy link

mgolois commented Dec 14, 2018

Hello,
I have an application that use using ml.net multiclassification trainer to predict a category. However, it seems as though we remove the TryGetScoreLabelNames() method from the library. Basically, the application would like to output as well how confident (in percentage) is the predicted label. How can I achieve that in ML.NET 0.8?

@wschin
Copy link
Member

wschin commented Dec 14, 2018

You can access Score column in the output.

@mgolois
Copy link
Author

mgolois commented Dec 14, 2018

Yes, @wschin , I have the Score column. How can I use it to get the probability, or is there away to calculate it with Score ?

@wschin
Copy link
Member

wschin commented Dec 14, 2018

In most cases (99%), it's called Score. Do you have a sample code for showing how APIs get used? Here is a possible example.

// Number of examples
private const int _rowNumber = 1000;
// Number of features
private const int _columnNumber = 5;
// Number of classes
private const int _classNumber = 3;
private class GbmExample
{
    [VectorType(_columnNumber)]
    public float[] Features;
    [KeyType(Contiguous = true, Count =_classNumber, Min = 0)]
    public uint Label;
    [VectorType(_classNumber)]
    public float[] Score;
}

[ConditionalFact(typeof(Environment), nameof(Environment.Is64BitProcess))] // LightGBM is 64-bit only
public void LightGbmMultiClassEstimatorCompare()
{
    // Training matrix. It contains all feature vectors.
    var dataMatrix = new float[_rowNumber * _columnNumber];
    // Labels for multi-class classification
    var labels = new uint[_rowNumber];
    // Training list, which is equivalent to the training matrix above.
    var dataList = new List<GbmExample>();
    for (/*row index*/ int i = 0; i < _rowNumber; ++i)
    {
        int featureSum = 0;
        var featureVector = new float[_columnNumber];
        for (/*column index*/ int j = 0; j < _columnNumber; ++j)
        {
            int featureValue = (j + i * _columnNumber) % 10;
            featureSum += featureValue;
            dataMatrix[j + i * _columnNumber] = featureValue;
            featureVector[j] = featureValue;
        }
        labels[i] = (uint)featureSum % _classNumber;
        dataList.Add(new GbmExample { Features = featureVector, Label = labels[i], Score = new float[_classNumber] });
    }

    var mlContext = new MLContext(seed: 0, conc: 1);
    var dataView = ComponentCreation.CreateDataView(mlContext, dataList);
    var gbmTrainer = new LightGbmMulticlassTrainer(mlContext, labelColumn: "Label", featureColumn: "Features", numBoostRound: 3,
        advancedSettings: s => { s.MinDataPerGroup = 1; s.MinDataPerLeaf = 1; });
    var gbm = gbmTrainer.Fit(dataView);
    var predicted = gbm.Transform(dataView);
    var predictions = new List<GbmExample>(predicted.AsEnumerable<GbmExample>(mlContext, false));
}

@mgolois
Copy link
Author

mgolois commented Dec 14, 2018

@wschin Here is my sample code:

static void Main(string[] args)
        {
            MLContext mlContext = new MLContext();
            string dataPath = Path.Combine(Environment.CurrentDirectory, "Data", "all.csv");

            TextLoader textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()
            {
                Separator = ",",
                HasHeader = false,
                Column = new[]
                {
                    new TextLoader.Column("Description", DataKind.Text, 1),
                    new TextLoader.Column("Type", DataKind.Text, 2),
                    new TextLoader.Column("Category", DataKind.Text, 3),
                }
            });

            var data = textLoader.Read(dataPath);
            var (trainData, testData) = mlContext.MulticlassClassification.TrainTestSplit(data, testFraction: 0.2);

         
            var dataProcessingPipeline = mlContext.Transforms.Categorical.OneHotEncoding("Type", "TypeEncoded")
                 .Append(mlContext.Transforms.Text.FeaturizeText("Description", "DescriptionEncoded"))
                 .Append(mlContext.Transforms.Concatenate("Features", "Encoded", "DescriptionEncoded"))
                 .Append(mlContext.Transforms.Conversion.MapValueToKey("Category", "Label"))
                 .Append(mlContext.MulticlassClassification.Trainers.StochasticDualCoordinateAscent())
                 .Append(mlContext.Transforms.Conversion.MapKeyToValue("PredictedLabel"));

       

            var model = dataProcessingPipeline.Fit(trainData);

            var metrics = mlContext.MulticlassClassification.Evaluate(model.Transform(testData));
            Console.WriteLine($"Accuracy Micro: {metrics.AccuracyMicro}");
            Console.WriteLine($"Accuracy Micro: {metrics.AccuracyMacro}");
          
            var predictionFunction = model.MakePredictionFunction<CategoryDetail, CategoryPrediction>(mlContext);
            var cat = predictionFunction.Predict(new CategoryDetail
                                                {
                                                    Description = "this is a great jacket",
                                                    Type = "Women"
                                                });

           //Now I need to output to the user the prediction cat.Prediction as well as the probability of that prediction

            Console.ReadKey();
        }


    public class CategoryDetail
    {
        [Column("1")]
        public string Description;
        [Column("2")]
        public string Type;
        [Column("3")]
        public string Category;
    }

    public class CategoryPrediction
    {
        [ColumnName("PredictedLabel")]
        public string Prediction { get; set; }

        [ColumnName("Score")]
        public float[] Score { get; set; }
    }

@wschin
Copy link
Member

wschin commented Dec 14, 2018

Can you access Score field in cat?

@mgolois
Copy link
Author

mgolois commented Dec 14, 2018

yes @wschin I can, it's just array for float. how to do I get a percentage out of it?

@wschin
Copy link
Member

wschin commented Dec 14, 2018

Scores[i] stores the probability of being the i-th class.

@mgolois
Copy link
Author

mgolois commented Dec 14, 2018

@wschin Ok, how do I get the list of labels from model. There used to be an extension method from the model in 0.7.
If I have the list of label names from the model, I can do
index = labelNames.indexOf(predictedLabel) then I can confidently do
Scores[index]

OR

is it safe to assume that the maximum score was the predicted label?
scores.Max()

@wschin
Copy link
Member

wschin commented Dec 14, 2018

I'd expect that the predicted name (column name: PredictedLabel) is there too, Prediction in CategoryPrediction. You can do scores.Max() as well and then find the label index associated with the best probability.

@singlis singlis added the question Further information is requested label Dec 14, 2018
@polymorp
Copy link

I would love to know how to do this too.
I can see the array of scores, but cannot be sure which label each Scores[index] relates to so whilst I know the percentage for the PredictedLabel result as that is the scores.max, I want to know the percentages for what came 2nd / 3rd etc. sometimes.

@rauhs
Copy link
Contributor

rauhs commented Dec 18, 2018

IMO there should really be a public API to get the label names out of a model. Since TryGetScoreLabelNames is removed there is currently no easy way. I think that it's very common to get the Top-k predictions (sorted) when doing multi class prediction. Many of the big image classifications tasks (AlexNet etc) get compared by Top-5 accuracy.

What I'm currently doing is this:

    public static string[] GetKeyValues(this Schema.Column column)
    {
      var metadata = column.Metadata;
      var labels = default(VBuffer<ReadOnlyMemory<char>>);
      metadata.GetValue(MetadataUtils.Kinds.KeyValues, ref labels);
      var names = new string[labels.Length];
      int index = 0;
      foreach (var label in labels.DenseValues())
        names[index++] = label.ToString();

      return names;
    }

And then doing something like this to get the label names:

      var encodedSample = encoder.Transform(ctx.CreateDataView(new[] { new YouInputInstanceClass() }));
      var encodedSchema = encodedSample.Schema;
      var labelNames = encodedSchema[LabelLabel].GetKeyValues();

Could we get a public API so that we can get the label names back out? Especially once you safe the model to a file and don't have a big training data-set loaded you'd also need to get the label names out of the model somehow.

@wschin
Copy link
Member

wschin commented Dec 18, 2018

Thanks a lot. This sounds a good idea but we need to investigate to see if doable. In your solution, the KeyValue attribute is accessed because
(1) you know that's the label column
(2) its metadata was not corrupted.
Some feature engineering steps don't preserve metadata, At least, we need to properly propogate metadata at each step in a pipeline and then calling

       /// <summary>
        /// Add slot names metadata.
        /// </summary>
        /// <param name="size">The size of the slot names vector.</param>
        /// <param name="getter">The getter delegate for the slot names.</param>
        public void AddSlotNames(int size, ValueGetter<VBuffer<ReadOnlyMemory<char>>> getter)
            => Add(MetadataUtils.Kinds.SlotNames, new VectorType(TextType.Instance, size), getter);

in each classifier to attach label names to their scores.

@karllarssonusawest
Copy link

Just jumping in to agree that returning the top N predictions with matching score is vital, something we've lost with the removal of 'TryGetScoreLabelNames'.

I'm able to do it when training the model and have all the data to hand (IDataView dataView parameter below) :

private static bool TryGetScoreLabelNames(TransformerChain<KeyToValueMappingTransformer> model, IDataView dataView, out string[] names, string scoreColumnName = DefaultColumnNames.Score)
        {
            names = (string[])null;
            Schema outputSchema = model.GetOutputSchema(dataView.Schema);
            int col = -1;
            if (!outputSchema.TryGetColumnIndex(scoreColumnName, out col))
            {
                return false;
            } 
VectorType valueCount = (VectorType)outputSchema.GetColumnType(col);

            if (!outputSchema.HasSlotNames(col, valueCount.Size))
            {
                return false;
            }

            VBuffer<ReadOnlyMemory<char>> vbuffer = new VBuffer<ReadOnlyMemory<char>>();
            outputSchema.GetMetadata<VBuffer<ReadOnlyMemory<char>>>("SlotNames", col, ref vbuffer);
            if (vbuffer.Length != valueCount.Size)
            {
                return false;
            }
            names = new string[valueCount.Size];
            int num = 0;
            foreach (ReadOnlyMemory<char> denseValue in vbuffer.DenseValues())
            {
                names[num++] = denseValue.ToString();
            }
            return true;
        }

However, I haven't figured out how to do that when I'm loading the model later on, to make a single prediction. At that point, I don't have an IDataView object I can pass into the function,

@wschin
Copy link
Member

wschin commented Dec 22, 2018

I just made an example for extracting label names from learned pipeline. Please take a look at #1953.

@karllarssonusawest
Copy link

Thanks, wschin - I couldn't translate your example into something that would work for me without a lot of retrofitting, but I did stumble on my own solution, I think. I keep the 'TryGetScoreLabelNames' function as is, but instead of using model.MakePredictionFunction to make my prediction, I broke that back down into its individual steps:

var inputData = mlContext.CreateDataView(new ClassOfItemToPredict[] { itemToPredict});
var outputData = _trainedModel.Transform(inputData);
var prediction = outputData.AsEnumerable<PredictionClass>(mlContext, reuseRowObject: false).Single();

 //Get labels
string[] scoresLabels;
if (TryGetScoreLabelNames(_trainedModel, outputData, out scoresLabels)) { };

That then gives me the IDataView I can pass into TryGetScoreLabelNames to get the schema etc...

@Ivanidzo4ka
Copy link
Contributor

Just a heads up.
It's still a bit complicated and required work with metadata which you already do, but in 0.10 prediction function will contain OutputSchema.
https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.Tests/Scenarios/Api/Estimators/PredictAndMetadata.cs
So that would slightly ease process of getting information. Work with metadata is still a mess, and we need to come up with more user friendly way to access it. But at least now, you don't need to do extra steps to access schema.

@biztechprogramming
Copy link

I looked at the sample code that supposedly closes this issue. It is extremely ugly code. This issue should not be closed.

@tangbinbinyes
Copy link

Scores[i] 存储成为第i类的概率。

Thanks.But how do I know which type of category "i" is?

@ghost ghost locked as resolved and limited conversation to collaborators Mar 26, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
9 participants