Skip to content

Commit b301aec

Browse files
authored
Lda snapping to template (#3442)
1 parent f3f54cf commit b301aec

File tree

5 files changed

+62
-42
lines changed

5 files changed

+62
-42
lines changed

src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs

+1-1
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ public static class ConversionsExtensionsCatalog
2828
/// are vectors or scalars.</param>
2929
/// <param name="inputColumnName">Name of the column whose data will be hashed.
3030
/// If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
31-
/// This estimator operates over text, numeric, boolean, key or <see cref="DataViewRowId"/> data types. </param>
31+
/// This estimator operates over vectors or scalars of text, numeric, boolean, key or <see cref="DataViewRowId"/> data types. </param>
3232
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 31, inclusive.</param>
3333
/// <param name="maximumNumberOfInverts">During hashing we construct mappings between original values and the produced hash values.
3434
/// Text representation of original values are stored in the slot names of the annotations for the new column.Hashing, as such, can map many initial values to one.

src/Microsoft.ML.Data/Transforms/ValueToKeyMappingEstimator.cs

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ namespace Microsoft.ML.Transforms
2121
/// | | |
2222
/// | -- | -- |
2323
/// | Does this estimator need to look at the data to train its parameters? | Yes |
24-
/// | Input column data type | Vector or primitive numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [System.DateTime](xref:System.DateTime) and [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
24+
/// | Input column data type | Scalar numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [System.DateTime](xref:System.DateTime) or [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
2525
/// | Output column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType)|
2626
///
2727
/// The ValueToKeyMappingEstimator builds up term vocabularies(dictionaries) mapping the input values to the keys on the dictionary.

src/Microsoft.ML.Transforms/Text/LdaTransform.cs

+53-2
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,9 @@ namespace Microsoft.ML.Transforms.Text
4646
//
4747
// See <a href="https://github.com/dotnet/machinelearning/blob/master/test/Microsoft.ML.TestFramework/DataPipe/TestDataPipe.cs"/>
4848
// for an example on how to use LatentDirichletAllocationTransformer.
49-
/// <include file='doc.xml' path='doc/members/member[@name="LightLDA"]/*' />
49+
/// <summary>
50+
/// <see cref="ITransformer"/> resulting from fitting a <see cref="LatentDirichletAllocationEstimator"/>.
51+
/// </summary>
5052
public sealed class LatentDirichletAllocationTransformer : OneToOneTransformerBase
5153
{
5254
internal sealed class Options : TransformInputBase
@@ -936,7 +938,56 @@ private protected override IRowMapper MakeRowMapper(DataViewSchema schema)
936938
=> new Mapper(this, schema);
937939
}
938940

939-
/// <include file='doc.xml' path='doc/members/member[@name="LightLDA"]/*' />
941+
/// <summary>
942+
/// The LDA transform implements <a href="https://arxiv.org/abs/1412.1576">LightLDA</a>, a state-of-the-art implementation of Latent Dirichlet Allocation.
943+
/// </summary>
944+
/// <remarks>
945+
/// <format type="text/markdown"><![CDATA[
946+
///
947+
/// ### Estimator Characteristics
948+
/// | | |
949+
/// | -- | -- |
950+
/// | Does this estimator need to look at the data to train its parameters? | Yes |
951+
/// | Input column data type | Vector of <xref:System.Single> |
952+
/// | Output column data type | Vector of <xref:System.Single>|
953+
///
954+
/// Latent Dirichlet Allocation is a well-known [topic modeling](https://en.wikipedia.org/wiki/Topic_model) algorithm that infers semantic structure from text data,
955+
/// and ultimately helps answer the question on "what is this document about?".
956+
/// It can be used to featurize any text fields as low-dimensional topical vectors.
957+
/// LightLDA is an extremely efficient implementation of LDA that incorporates a number of
958+
/// optimization techniques.
959+
/// With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million words vocabulary
960+
/// on a 1-billion-token document set one a single machine in a few hours(typically, LDA at this scale takes days and requires large clusters).
961+
/// The most significant innovation is a super-efficient $O(1)$. [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm),
962+
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
963+
///
964+
/// In an ML.NET pipeline, this estimator requires the output of some preprocessing, as its input.
965+
/// A typical pipeline operating on text would require text normalization, tokenization and producing n-grams to supply to the LDA estimator.
966+
/// See the example usage in the See Also section for usage suggestions.
967+
///
968+
/// If we have the following three examples of text, as data points, and use the LDA transform with the number of topics set to 3,
969+
/// we would get the results displayed in the table below. Example documents:
970+
/// * I like to eat bananas.
971+
/// * I eat bananas everyday.
972+
/// * First celebrated in 1970, Earth Day now includes events in more than 193 countries,
973+
/// which are now coordinated globally by the Earth Day Network.
974+
///
975+
/// Notice the similarity in values of the first and second row, compared to the third,
976+
/// and see how those values are indicative of similarities between those two (small) bodies of text.
977+
///
978+
/// | Topic1 | Topic2 | Topic 3 |
979+
/// | ------- | ------- | ------- |
980+
/// | 0.5714 | 0.0000 | 0.4286 |
981+
/// | 0.5714 | 0.0000 | 0.4286 |
982+
/// | 0.2400 | 0.3200 | 0.4400 |
983+
///
984+
/// For more technical details you can consult the following papers.
985+
/// * [LightLDA: Big Topic Models on Modest Computer Clusters](https://arxiv.org/abs/1412.1576)
986+
/// * [LightLDA](https://github.com/Microsoft/LightLDA)
987+
///
988+
/// ]]></format>
989+
/// </remarks>
990+
/// <seealso cref="TextCatalog.LatentDirichletAllocation(TransformsCatalog.TextTransforms, string, string, int, float, float, int, int, int, int, int, int, int, bool)"/>
940991
public sealed class LatentDirichletAllocationEstimator : IEstimator<LatentDirichletAllocationTransformer>
941992
{
942993
[BestFriend]

src/Microsoft.ML.Transforms/Text/TextCatalog.cs

+7-4
Original file line numberDiff line numberDiff line change
@@ -556,12 +556,15 @@ internal static NgramHashingEstimator ProduceHashedNgrams(this TransformsCatalog
556556
=> new NgramHashingEstimator(Contracts.CheckRef(catalog, nameof(catalog)).GetEnvironment(), columns);
557557

558558
/// <summary>
559-
/// Uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform a document (represented as a vector of floats)
560-
/// into a vector of floats over a set of topics.
559+
/// Create a <see cref="LatentDirichletAllocationEstimator"/>, which uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform text (represented as a vector of floats)
560+
/// into a vector of <see cref="System.Single"/> indicating the similarity of the text with each topic identified.
561561
/// </summary>
562562
/// <param name="catalog">The transform's catalog.</param>
563-
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
564-
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
563+
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
564+
/// This estimator outputs a vector of <see cref="System.Single"/>.</param>
565+
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
566+
/// This estimator operates over a vector of <see cref="System.Single"/>.
567+
/// </param>
565568
/// <param name="numberOfTopics">The number of topics.</param>
566569
/// <param name="alphaSum">Dirichlet prior on document-topic vectors.</param>
567570
/// <param name="beta">Dirichlet prior on vocab-topic vectors.</param>

src/Microsoft.ML.Transforms/Text/doc.xml

-34
Original file line numberDiff line numberDiff line change
@@ -150,40 +150,6 @@
150150
</example>
151151
</member>
152152

153-
<member name="LightLDA">
154-
<summary>
155-
The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.
156-
</summary>
157-
<remarks>
158-
Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers topical structure from text data,
159-
and can be used to featurize any text fields as low-dimensional topical vectors.
160-
<para>LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
161-
optimization techniques. See <a href="https://arxiv.org/abs/1412.1576">LightLDA: Big Topic Models on Modest Compute Clusters</a>.
162-
</para>
163-
<para>
164-
With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary
165-
on a 1-billion-token document set one a single machine in a few hours (typically, LDA at this scale takes days and requires large clusters).
166-
The most significant innovation is a super-efficient O(1) <a href="https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis-Hastings sampling algorithm</a>,
167-
whose running cost is (surprisingly) agnostic of model size,
168-
allowing it to converges nearly an order of magnitude faster than other <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers.</a>
169-
</para>
170-
<para>
171-
For more details please see original LightLDA paper, and its open source implementation.
172-
<list type="bullet">
173-
<item><description><a href="https://arxiv.org/abs/1412.1576"> LightLDA: Big Topic Models on Modest Computer Clusters</a></description></item>
174-
<item><description><a href=" https://github.com/Microsoft/LightLDA">LightLDA </a></description></item>
175-
</list>
176-
</para>
177-
</remarks>
178-
</member>
179-
<example name="LightLDA">
180-
<example>
181-
<code language="csharp">
182-
pipeline.Add(new LightLda((&quot;InTextCol&quot; , &quot;OutTextCol&quot;)));
183-
</code>
184-
</example>
185-
</example>
186-
187153
<member name="WordEmbeddings">
188154
<summary>
189155
Word Embeddings transform is a text featurizer which converts vectors of text tokens into sentence vectors using a pre-trained model.

0 commit comments

Comments
 (0)