You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/Microsoft.ML.Data/Transforms/ConversionsExtensionsCatalog.cs
+1-1
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ public static class ConversionsExtensionsCatalog
28
28
/// are vectors or scalars.</param>
29
29
/// <param name="inputColumnName">Name of the column whose data will be hashed.
30
30
/// If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
31
-
/// This estimator operates over text, numeric, boolean, key or <see cref="DataViewRowId"/> data types. </param>
31
+
/// This estimator operates over vectors or scalars of text, numeric, boolean, key or <see cref="DataViewRowId"/> data types. </param>
32
32
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 31, inclusive.</param>
33
33
/// <param name="maximumNumberOfInverts">During hashing we construct mappings between original values and the produced hash values.
34
34
/// Text representation of original values are stored in the slot names of the annotations for the new column.Hashing, as such, can map many initial values to one.
/// | Does this estimator need to look at the data to train its parameters? | Yes |
24
-
/// | Input column data type | Vector or primitive numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [System.DateTime](xref:System.DateTime) and [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
24
+
/// | Input column data type | Scalar numeric, boolean, [text](xref:Microsoft.ML.Data.TextDataViewType), [System.DateTime](xref:System.DateTime) or [key](xref:Microsoft.ML.Data.KeyDataViewType) data types.|
25
25
/// | Output column data type | [key](xref:Microsoft.ML.Data.KeyDataViewType)|
26
26
///
27
27
/// The ValueToKeyMappingEstimator builds up term vocabularies(dictionaries) mapping the input values to the keys on the dictionary.
/// The LDA transform implements <a href="https://arxiv.org/abs/1412.1576">LightLDA</a>, a state-of-the-art implementation of Latent Dirichlet Allocation.
943
+
/// </summary>
944
+
/// <remarks>
945
+
/// <format type="text/markdown">< algorithm that infers semantic structure from text data,
955
+
/// and ultimately helps answer the question on "what is this document about?".
956
+
/// It can be used to featurize any text fields as low-dimensional topical vectors.
957
+
/// LightLDA is an extremely efficient implementation of LDA that incorporates a number of
958
+
/// optimization techniques.
959
+
/// With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million words vocabulary
960
+
/// on a 1-billion-token document set one a single machine in a few hours(typically, LDA at this scale takes days and requires large clusters).
961
+
/// The most significant innovation is a super-efficient $O(1)$. [Metropolis-Hastings sampling algorithm](https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm),
962
+
/// whose running cost is agnostic of model size, allowing it to converges nearly an order of magnitude faster than other [Gibbs samplers](https://en.wikipedia.org/wiki/Gibbs_sampling).
963
+
///
964
+
/// In an ML.NET pipeline, this estimator requires the output of some preprocessing, as its input.
965
+
/// A typical pipeline operating on text would require text normalization, tokenization and producing n-grams to supply to the LDA estimator.
966
+
/// See the example usage in the See Also section for usage suggestions.
967
+
///
968
+
/// If we have the following three examples of text, as data points, and use the LDA transform with the number of topics set to 3,
969
+
/// we would get the results displayed in the table below. Example documents:
970
+
/// * I like to eat bananas.
971
+
/// * I eat bananas everyday.
972
+
/// * First celebrated in 1970, Earth Day now includes events in more than 193 countries,
973
+
/// which are now coordinated globally by the Earth Day Network.
974
+
///
975
+
/// Notice the similarity in values of the first and second row, compared to the third,
976
+
/// and see how those values are indicative of similarities between those two (small) bodies of text.
977
+
///
978
+
/// | Topic1 | Topic2 | Topic 3 |
979
+
/// | ------- | ------- | ------- |
980
+
/// | 0.5714 | 0.0000 | 0.4286 |
981
+
/// | 0.5714 | 0.0000 | 0.4286 |
982
+
/// | 0.2400 | 0.3200 | 0.4400 |
983
+
///
984
+
/// For more technical details you can consult the following papers.
985
+
/// * [LightLDA: Big Topic Models on Modest Computer Clusters](https://arxiv.org/abs/1412.1576)
/// Uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform a document (represented as a vector of floats)
560
-
/// into a vector of floats over a set of topics.
559
+
/// Create a <see cref="LatentDirichletAllocationEstimator"/>, which uses <a href="https://arxiv.org/abs/1412.1576">LightLDA</a> to transform text (represented as a vector of floats)
560
+
/// into a vector of <see cref="System.Single"/> indicating the similarity of the text with each topic identified.
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
564
-
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
563
+
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
564
+
/// This estimator outputs a vector of <see cref="System.Single"/>.</param>
565
+
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
566
+
/// This estimator operates over a vector of <see cref="System.Single"/>.
567
+
/// </param>
565
568
/// <param name="numberOfTopics">The number of topics.</param>
566
569
/// <param name="alphaSum">Dirichlet prior on document-topic vectors.</param>
567
570
/// <param name="beta">Dirichlet prior on vocab-topic vectors.</param>
Copy file name to clipboardExpand all lines: src/Microsoft.ML.Transforms/Text/doc.xml
-34
Original file line number
Diff line number
Diff line change
@@ -150,40 +150,6 @@
150
150
</example>
151
151
</member>
152
152
153
-
<membername="LightLDA">
154
-
<summary>
155
-
The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.
156
-
</summary>
157
-
<remarks>
158
-
Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers topical structure from text data,
159
-
and can be used to featurize any text fields as low-dimensional topical vectors.
160
-
<para>LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
161
-
optimization techniques. See <ahref="https://arxiv.org/abs/1412.1576">LightLDA: Big Topic Models on Modest Compute Clusters</a>.
162
-
</para>
163
-
<para>
164
-
With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary
165
-
on a 1-billion-token document set one a single machine in a few hours (typically, LDA at this scale takes days and requires large clusters).
166
-
The most significant innovation is a super-efficient O(1) <ahref="https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis-Hastings sampling algorithm</a>,
167
-
whose running cost is (surprisingly) agnostic of model size,
168
-
allowing it to converges nearly an order of magnitude faster than other <ahref="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers.</a>
169
-
</para>
170
-
<para>
171
-
For more details please see original LightLDA paper, and its open source implementation.
172
-
<listtype="bullet">
173
-
<item><description><ahref="https://arxiv.org/abs/1412.1576"> LightLDA: Big Topic Models on Modest Computer Clusters</a></description></item>
0 commit comments