-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Created sample for 'LatentDirichletAllocation' API. #3191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
0853b27
ba42f31
621ed99
cc2d80a
883eaa3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
This file was deleted.
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
using System; | ||
using System.Collections.Generic; | ||
using Microsoft.ML.Data; | ||
|
||
namespace Microsoft.ML.Samples.Dynamic | ||
{ | ||
public static class LatentDirichletAllocation | ||
{ | ||
public static void Example() | ||
{ | ||
// Create a new ML context, for ML.NET operations. It can be used for exception tracking and logging, | ||
// as well as the source of randomness. | ||
var mlContext = new MLContext(); | ||
|
||
// Create a small dataset as an IEnumerable. | ||
var samples = new List<TextData>() | ||
{ | ||
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API computes topic models." }, | ||
new TextData(){ Text = "ML.NET's LatentDirichletAllocation API is the best for topic models." }, | ||
new TextData(){ Text = "I like to eat broccoli and bananas." }, | ||
new TextData(){ Text = "I eat bananas for breakfast." }, | ||
new TextData(){ Text = "This car is expensive compared to last week's price." }, | ||
new TextData(){ Text = "This car was $X last week." }, | ||
}; | ||
|
||
// Convert training data to IDataView. | ||
var dataview = mlContext.Data.LoadFromEnumerable(samples); | ||
|
||
// A pipeline for featurizing the text/string using LatentDirichletAllocation API. | ||
// To be more accurate in computing the LDA features, the pipeline first normalizes text and removes stop words | ||
// before passing tokens (the individual words, lower cased, with common words removed) to LatentDirichletAllocation. | ||
var pipeline = mlContext.Transforms.Text.NormalizeText("NormalizedText", "Text") | ||
.Append(mlContext.Transforms.Text.TokenizeIntoWords("Tokens", "NormalizedText")) | ||
.Append(mlContext.Transforms.Text.RemoveDefaultStopWords("Tokens")) | ||
.Append(mlContext.Transforms.Conversion.MapValueToKey("Tokens")) | ||
.Append(mlContext.Transforms.Text.ProduceNgrams("Tokens")) | ||
.Append(mlContext.Transforms.Text.LatentDirichletAllocation("Features", "Tokens", numberOfTopics: 3)); | ||
|
||
// Fit to data. | ||
var transformer = pipeline.Fit(dataview); | ||
|
||
// Create the prediction engine to get the LDA features extracted from the text. | ||
var predictionEngine = mlContext.Model.CreatePredictionEngine<TextData, TransformedTextData>(transformer); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Similar to the other PR, I wonder if we should stay entirely within IDataView and not create a prediction engine. That is, use a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is done because some of the transforms related to text processing such as (NormalizeText. TokenizeIntoWords etc.) don't need training data. In such cases, prediction engine seems more appropriate. But we can definitely have consensus on this. I will follow up. In reply to: 271981536 [](ancestors = 271981536) |
||
|
||
// Convert the sample text into LDA features and print it. | ||
PrintLdaFeatures(predictionEngine.Predict(samples[0])); | ||
PrintLdaFeatures(predictionEngine.Predict(samples[1])); | ||
|
||
// Features obtained post-transformation. | ||
// For LatentDirichletAllocation, we had specified numTopic:3. Hence each prediction has been featurized as a vector of floats with length 3. | ||
|
||
// Topic1 Topic2 Topic3 | ||
// 0.6364 0.2727 0.0909 | ||
// 0.5455 0.1818 0.2727 | ||
} | ||
|
||
private static void PrintLdaFeatures(TransformedTextData prediction) | ||
{ | ||
for (int i = 0; i < prediction.Features.Length; i++) | ||
Console.Write($"{prediction.Features[i]:F4} "); | ||
Console.WriteLine(); | ||
} | ||
|
||
private class TextData | ||
{ | ||
public string Text { get; set; } | ||
} | ||
|
||
private class TransformedTextData : TextData | ||
{ | ||
public float[] Features { get; set; } | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually want to run LDA on top of 2 ngrams since 2 is default value for ProduceNgrams or we should recommend to use ngrams:1 ? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 is fine as it backs off to unigram ( useAllLengths=true) . I think higher is better in case there is a lot of data available.
In reply to: 271930804 [](ancestors = 271930804)