Skip to content

Created sample for 'LatentDirichletAllocation' API. #3191

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 5, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 0 additions & 61 deletions docs/samples/Microsoft.ML.Samples/Dynamic/LdaTransform.cs

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
using System;
using System.Collections.Generic;
using Microsoft.ML.Data;

namespace Microsoft.ML.Samples.Dynamic
{
public static class LatentDirichletAllocation
{
public static void Example()
{
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging,
// as well as the source of randomness.
var mlContext = new MLContext();

// Create a small dataset as an IEnumerable.
var samples = new List<TextData>()
{
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic models." },
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API is the best for topic models." },
new TextData(){ Text = "I like to eat broccoli and bananas." },
new TextData(){ Text = "I eat bananas for breakfast." },
new TextData(){ Text = "This car is expensive compared to last week's price." },
new TextData(){ Text = "This car was $X last week." },
};

// Convert training data to IDataView.
var dataview = mlContext.Data.LoadFromEnumerable(samples);

// A pipeline for featurizing the text/string using LatentDirichletAllocation API.
// To be more accurate in computing the LDA features, the pipeline first normalizes text and removes stop words
// before passing tokens (the individual words, lower cased, with common words removed) to LatentDirichletAllocation.
var pipeline = mlContext.Transforms.Text.NormalizeText("NormalizedText", "Text")
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "NormalizedText"))
.Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Tokens"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens"))
.Append(mlContext.Transforms.Text.ProduceNgrams("Tokens"))
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProduceNgrams [](start = 50, length = 13)

Do we actually want to run LDA on top of 2 ngrams since 2 is default value for ProduceNgrams or we should recommend to use ngrams:1 ? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 is fine as it backs off to unigram ( useAllLengths=true) . I think higher is better in case there is a lot of data available.


In reply to: 271930804 [](ancestors = 271930804)

.Append(mlContext.Transforms.Text.LatentDirichletAllocation("Features", "Tokens", numberOfTopics: 3));

// Fit to data.
var transformer = pipeline.Fit(dataview);

// Create the prediction engine to get the LDA features extracted from the text.
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(transformer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predictionEngine [](start = 16, length = 16)

Similar to the other PR, I wonder if we should stay entirely within IDataView and not create a prediction engine. That is, use a TakeRows filter followed by a CreateEnumerable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done because some of the transforms related to text processing such as (NormalizeText. TokenizeIntoWords etc.) don't need training data. In such cases, prediction engine seems more appropriate. But we can definitely have consensus on this. I will follow up.


In reply to: 271981536 [](ancestors = 271981536)


// Convert the sample text into LDA features and print it.
PrintLdaFeatures(predictionEngine.Predict(samples[0]));
PrintLdaFeatures(predictionEngine.Predict(samples[1]));

// Features obtained post-transformation.
// For LatentDirichletAllocation, we had specified numTopic:3. Hence each prediction has been featurized as a vector of floats with length 3.

// Topic1 Topic2 Topic3
// 0.6364 0.2727 0.0909
// 0.5455 0.1818 0.2727
}

private static void PrintLdaFeatures(TransformedTextData prediction)
{
for (int i = 0; i < prediction.Features.Length; i++)
Console.Write($"{prediction.Features[i]:F4} ");
Console.WriteLine();
}

private class TextData
{
public string Text { get; set; }
}

private class TransformedTextData : TextData
{
public float[] Features { get; set; }
}
}
}
2 changes: 1 addition & 1 deletion src/Microsoft.ML.Transforms/Text/TextCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -509,7 +509,7 @@ internal static NgramHashingEstimator ProduceHashedNgrams(this TransformsCatalog
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[LatentDirichletAllocation](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/LdaTransform.cs)]
/// [!code-csharp[LatentDirichletAllocation](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/Text/LatentDirichletAllocation.cs)]
/// ]]>
/// </format>
/// </example>
Expand Down