Skip to content

Update documentation for WordBag #3440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/Microsoft.ML.Transforms/Text/NgramHashingTransformer.cs
Original file line number Diff line number Diff line change
Expand Up @@ -866,11 +866,11 @@ public VBuffer<ReadOnlyMemory<char>>[] SlotNamesMetadata(out VectorDataViewType[
/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector of [Key](<xref:Microsoft.ML.Data.KeyDataViewType>) |
/// | Input column data type | Vector of [Key](xref:Microsoft.ML.Data.KeyDataViewType) |
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Key [](start = 46, length = 3)

I would just do xref:Microsoft.ML.Data.KeyDataViewType #ByDesign

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's desirable to keep it as Ivan did, at least I believe that's the convention for key type: name it key, and link the KeyDataViewType.


In reply to: 277109991 [](ancestors = 277109991)

/// | Output column data type | Vector of known size of <xref:System.Single> |
///
/// The resulting <xref:Microsoft.ML.Transforms.Text.NgramHashingTransformer/> creates a new column, named as specified in the output column name parameters, and
/// produces a vector of counts of n-grams (sequences of consecutive words of length 1-n) from a given data.
/// produces a vector of n-gram counts (sequences of consecutive words of length 1-n) from a given data.
/// It does so by hashing each n-gram and using the hash value as the index in the bag.
///
/// <xref:Microsoft.ML.Transforms.Text.NgramHashingEstimator> is different from <xref:Microsoft.ML.Transforms.Text.WordHashBagEstimator>
Expand Down
64 changes: 44 additions & 20 deletions src/Microsoft.ML.Transforms/Text/TextCatalog.cs
Original file line number Diff line number Diff line change
Expand Up @@ -293,12 +293,18 @@ public static CustomStopWordsRemovingEstimator RemoveStopWords(this TransformsCa
=> new CustomStopWordsRemovingEstimator(Contracts.CheckRef(catalog, nameof(catalog)).GetEnvironment(), outputColumnName, inputColumnName, stopwords);

/// <summary>
/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnName"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordHashBagEstimator"/>, which maps the column specified in <paramref name="inputColumnName"/>
/// to a vector of n-gram counts in a new column named <paramref name="outputColumnName"/>.
/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <remarks>
/// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in that the former
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest removing this and adding a clarification to the input column definition.

/// tokenizes text internally and the latter takes tokenized text as input.
/// </remarks>
/// <param name="catalog">The transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnName">Name of the column to take the data from.
/// This estimator operates over vector of text.</param>
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text [](start = 51, length = 4)

no cref=System.String? I know text is pretty obvious, I am just thinking for consistency. #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we had discussion few days ago. we use just text in normal comments


In reply to: 277113448 [](ancestors = 277113448)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a good place to put the clarification about the input text i.e. that it is tokenized.

This estimator operates over a vector of tokenized text. This is different from the see cref="NgramExtractingEstimator", which tokenizes the text itself.

/// <param name="ngramLength">Ngram length.</param>
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param>
/// <param name="useAllLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param>
Expand All @@ -316,12 +322,18 @@ public static WordBagEstimator ProduceWordBags(this TransformsCatalog.TextTransf
outputColumnName, inputColumnName, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting);

/// <summary>
/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnNames"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multiple columns specified in <paramref name="inputColumnNames"/>
/// to a vector of n-gram counts in a new column named <paramref name="outputColumnName"/>.
/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.</param>
/// <param name="inputColumnNames">Name of the columns to transform.</param>
/// <remarks>
/// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in that the former
/// tokenizes text internally and the latter takes tokenized text as input.
/// </remarks>
/// <param name="catalog">The transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnNames">Names of the multiple columns to take the data from.
/// This estimator operates over vector of text.</param>
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text [](start = 51, length = 4)

cref=system.String? #ByDesign

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you take this change, please update all functions


In reply to: 277113589 [](ancestors = 277113589)

/// <param name="ngramLength">Ngram length.</param>
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param>
/// <param name="useAllLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param>
Expand All @@ -339,12 +351,18 @@ public static WordBagEstimator ProduceWordBags(this TransformsCatalog.TextTransf
outputColumnName, inputColumnNames, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting);

/// <summary>
/// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnName"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordHashBagEstimator"/>, which maps the column specified in <paramref name="inputColumnName"/>
/// to a vector of counts of hashed n-grams in a new column named <paramref name="outputColumnName"/>.
Copy link
Member

@singlis singlis Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

counts of hashed n-grams [](start = 27, length = 24)

vector of hashed n-gram counts? #Resolved

/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <remarks>
/// <see cref="WordHashBagEstimator"/> is different from <see cref="NgramHashingEstimator"/> in that the former
/// tokenizes text internally and the latter takes tokenized text as input.
/// </remarks>
/// <param name="catalog">The transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnName">Name of the column to take the data from.
/// This estimator operates over vector of text.</param>
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param>
/// <param name="ngramLength">Ngram length.</param>
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param>
Expand All @@ -371,12 +389,18 @@ public static WordHashBagEstimator ProduceHashedWordBags(this TransformsCatalog.
maximumNumberOfInverts: maximumNumberOfInverts);

/// <summary>
/// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnNames"/>
/// and outputs bag of word vector as <paramref name="outputColumnName"/>
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multiple columns specified in <paramref name="inputColumnNames"/>
/// to a vector of counts of hashed n-grams in a new column named <paramref name="outputColumnName"/>.
/// </summary>
/// <param name="catalog">The text-related transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.</param>
/// <param name="inputColumnNames">Name of the columns to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <remarks>
/// <see cref="WordHashBagEstimator"/> is different from <see cref="NgramHashingEstimator"/> in that the former
/// tokenizes text internally and the latter takes tokenized text as input.
/// </remarks>
/// <param name="catalog">The transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnNames">Names of the multiple columns to take the data from.
/// This estimator operates over vector of text.</param>
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param>
/// <param name="ngramLength">Ngram length.</param>
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param>
Expand Down
49 changes: 44 additions & 5 deletions src/Microsoft.ML.Transforms/Text/WrappedTextTransformers.cs
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,30 @@

namespace Microsoft.ML.Transforms.Text
{

/// <summary>
/// Produces a bag of counts of ngrams (sequences of consecutive words) in a given text.
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag.
/// <see cref="IEstimator{TTransformer}"/> for the <see cref="ITransformer"/>.
/// </summary>
/// <remarks>
/// <format type="text/markdown"><![CDATA[
/// ### Estimator Characteristics
/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector of [Text](xref:Microsoft.ML.Data.TextDataViewType) |
/// | Output column data type | Vector of known-size of <xref:System.Single> |
///
/// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and
/// produces a vector of n-gram counts (sequences of n consecutive words) from a given data.
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag.
///
/// <xref:Microsoft.ML.Transforms.Text.WordBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramExtractingEstimator>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above

/// in that the former takes tokenizes text internally while the latter takes tokenized text as input.
/// See the See Also section for links to examples of the usage.
/// ]]>
/// </format>
/// </remarks>
/// <seealso cref="TextCatalog.ProduceWordBags(TransformsCatalog.TextTransforms, string, string, int, int, bool, int, NgramExtractingEstimator.WeightingCriteria)" />
/// <seealso cref="TextCatalog.ProduceWordBags(TransformsCatalog.TextTransforms, string, string[], int, int, bool, int, NgramExtractingEstimator.WeightingCriteria)" />
public sealed class WordBagEstimator : IEstimator<ITransformer>
{
private readonly IHost _host;
Expand Down Expand Up @@ -182,9 +201,29 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema)
}

/// <summary>
/// Produces a bag of counts of ngrams (sequences of consecutive words of length 1-n) in a given text.
/// It does so by hashing each ngram and using the hash value as the index in the bag.
/// <see cref="IEstimator{TTransformer}"/> for the <see cref="ITransformer"/>.
/// </summary>
/// <remarks>
/// <format type="text/markdown"><![CDATA[
/// ### Estimator Characteristics
/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Vector of [Text](xref:Microsoft.ML.Data.TextDataViewType) |
/// | Output column data type | Vector of known-size of <xref:System.Single> |
///
/// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and
/// produces a vector of n-gram counts (sequences of n consecutive words) from a given data.
/// It does so by hashing each ngram and using the hash value as the index in the bag.
///
/// <xref:Microsoft.ML.Transforms.Text.WordHashBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramHashingEstimator>
/// in that the former takes tokenizes text internally while the latter takes tokenized text as input.
/// See the See Also section for links to examples of the usage.
/// ]]>
/// </format>
/// </remarks>
/// <seealso cref="TextCatalog.ProduceHashedWordBags(TransformsCatalog.TextTransforms, string, string, int, int, int, bool, uint, bool, int)" />
/// <seealso cref="TextCatalog.ProduceHashedWordBags(TransformsCatalog.TextTransforms, string, string[], int, int, int, bool, uint, bool, int)" />
public sealed class WordHashBagEstimator : IEstimator<ITransformer>
{
private readonly IHost _host;
Expand Down