-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Update documentation for WordBag #3440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
8de7ce4
10bf564
d6d39f7
8d40275
d56383e
d66ec54
5efcc2c
685860c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -293,12 +293,18 @@ public static CustomStopWordsRemovingEstimator RemoveStopWords(this TransformsCa | |
=> new CustomStopWordsRemovingEstimator(Contracts.CheckRef(catalog, nameof(catalog)).GetEnvironment(), outputColumnName, inputColumnName, stopwords); | ||
|
||
/// <summary> | ||
/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnName"/> | ||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | ||
/// Create a <see cref="WordHashBagEstimator"/>, which maps the column specified in <paramref name="inputColumnName"/> | ||
/// to a vector of n-gram counts in a new column named <paramref name="outputColumnName"/>. | ||
/// </summary> | ||
/// <param name="catalog">The text-related transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param> | ||
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param> | ||
/// <remarks> | ||
/// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in that the former | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggest removing this and adding a clarification to the input column definition. |
||
/// tokenizes text internally and the latter takes tokenized text as input. | ||
/// </remarks> | ||
/// <param name="catalog">The transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>. | ||
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param> | ||
/// <param name="inputColumnName">Name of the column to take the data from. | ||
/// This estimator operates over vector of text.</param> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
no cref=System.String? I know text is pretty obvious, I am just thinking for consistency. #ByDesign There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we had discussion few days ago. we use just text in normal comments In reply to: 277113448 [](ancestors = 277113448) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could be a good place to put the clarification about the input text i.e. that it is tokenized. This estimator operates over a vector of tokenized text. This is different from the see cref="NgramExtractingEstimator", which tokenizes the text itself. |
||
/// <param name="ngramLength">Ngram length.</param> | ||
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param> | ||
/// <param name="useAllLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param> | ||
|
@@ -316,12 +322,18 @@ public static WordBagEstimator ProduceWordBags(this TransformsCatalog.TextTransf | |
outputColumnName, inputColumnName, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting); | ||
|
||
/// <summary> | ||
/// Produces a bag of counts of ngrams (sequences of consecutive words) in <paramref name="inputColumnNames"/> | ||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | ||
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multiple columns specified in <paramref name="inputColumnNames"/> | ||
/// to a vector of n-gram counts in a new column named <paramref name="outputColumnName"/>. | ||
/// </summary> | ||
/// <param name="catalog">The text-related transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.</param> | ||
/// <param name="inputColumnNames">Name of the columns to transform.</param> | ||
/// <remarks> | ||
/// <see cref="WordBagEstimator"/> is different from <see cref="NgramExtractingEstimator"/> in that the former | ||
/// tokenizes text internally and the latter takes tokenized text as input. | ||
/// </remarks> | ||
/// <param name="catalog">The transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>. | ||
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param> | ||
/// <param name="inputColumnNames">Names of the multiple columns to take the data from. | ||
/// This estimator operates over vector of text.</param> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
cref=system.String? #ByDesign There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you take this change, please update all functions In reply to: 277113589 [](ancestors = 277113589) |
||
/// <param name="ngramLength">Ngram length.</param> | ||
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param> | ||
/// <param name="useAllLengths">Whether to include all ngram lengths up to <paramref name="ngramLength"/> or only <paramref name="ngramLength"/>.</param> | ||
|
@@ -339,12 +351,18 @@ public static WordBagEstimator ProduceWordBags(this TransformsCatalog.TextTransf | |
outputColumnName, inputColumnNames, ngramLength, skipLength, useAllLengths, maximumNgramsCount, weighting); | ||
|
||
/// <summary> | ||
/// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnName"/> | ||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | ||
/// Create a <see cref="WordHashBagEstimator"/>, which maps the column specified in <paramref name="inputColumnName"/> | ||
/// to a vector of counts of hashed n-grams in a new column named <paramref name="outputColumnName"/>. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
vector of hashed n-gram counts? #Resolved |
||
/// </summary> | ||
/// <param name="catalog">The text-related transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param> | ||
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param> | ||
/// <remarks> | ||
/// <see cref="WordHashBagEstimator"/> is different from <see cref="NgramHashingEstimator"/> in that the former | ||
/// tokenizes text internally and the latter takes tokenized text as input. | ||
/// </remarks> | ||
/// <param name="catalog">The transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>. | ||
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param> | ||
/// <param name="inputColumnName">Name of the column to take the data from. | ||
/// This estimator operates over vector of text.</param> | ||
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param> | ||
/// <param name="ngramLength">Ngram length.</param> | ||
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param> | ||
|
@@ -371,12 +389,18 @@ public static WordHashBagEstimator ProduceHashedWordBags(this TransformsCatalog. | |
maximumNumberOfInverts: maximumNumberOfInverts); | ||
|
||
/// <summary> | ||
/// Produces a bag of counts of hashed ngrams in <paramref name="inputColumnNames"/> | ||
/// and outputs bag of word vector as <paramref name="outputColumnName"/> | ||
/// Create a <see cref="WordHashBagEstimator"/>, which maps the multiple columns specified in <paramref name="inputColumnNames"/> | ||
/// to a vector of counts of hashed n-grams in a new column named <paramref name="outputColumnName"/>. | ||
/// </summary> | ||
/// <param name="catalog">The text-related transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>.</param> | ||
/// <param name="inputColumnNames">Name of the columns to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param> | ||
/// <remarks> | ||
/// <see cref="WordHashBagEstimator"/> is different from <see cref="NgramHashingEstimator"/> in that the former | ||
/// tokenizes text internally and the latter takes tokenized text as input. | ||
/// </remarks> | ||
/// <param name="catalog">The transform's catalog.</param> | ||
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnNames"/>. | ||
/// This column's data type will be known-size vector of <see cref="System.Single"/>.</param> | ||
/// <param name="inputColumnNames">Names of the multiple columns to take the data from. | ||
/// This estimator operates over vector of text.</param> | ||
/// <param name="numberOfBits">Number of bits to hash into. Must be between 1 and 30, inclusive.</param> | ||
/// <param name="ngramLength">Ngram length.</param> | ||
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param> | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,11 +10,30 @@ | |
|
||
namespace Microsoft.ML.Transforms.Text | ||
{ | ||
|
||
/// <summary> | ||
/// Produces a bag of counts of ngrams (sequences of consecutive words) in a given text. | ||
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag. | ||
/// <see cref="IEstimator{TTransformer}"/> for the <see cref="ITransformer"/>. | ||
/// </summary> | ||
/// <remarks> | ||
/// <format type="text/markdown">< | | ||
/// | Output column data type | Vector of known-size of <xref:System.Single> | | ||
/// | ||
/// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and | ||
/// produces a vector of n-gram counts (sequences of n consecutive words) from a given data. | ||
/// It does so by building a dictionary of ngrams and using the id in the dictionary as the index in the bag. | ||
/// | ||
/// <xref:Microsoft.ML.Transforms.Text.WordBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramExtractingEstimator> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same comment as above |
||
/// in that the former takes tokenizes text internally while the latter takes tokenized text as input. | ||
/// See the See Also section for links to examples of the usage. | ||
/// ]]> | ||
/// </format> | ||
/// </remarks> | ||
/// <seealso cref="TextCatalog.ProduceWordBags(TransformsCatalog.TextTransforms, string, string, int, int, bool, int, NgramExtractingEstimator.WeightingCriteria)" /> | ||
/// <seealso cref="TextCatalog.ProduceWordBags(TransformsCatalog.TextTransforms, string, string[], int, int, bool, int, NgramExtractingEstimator.WeightingCriteria)" /> | ||
public sealed class WordBagEstimator : IEstimator<ITransformer> | ||
{ | ||
private readonly IHost _host; | ||
|
@@ -182,9 +201,29 @@ public SchemaShape GetOutputSchema(SchemaShape inputSchema) | |
} | ||
|
||
/// <summary> | ||
/// Produces a bag of counts of ngrams (sequences of consecutive words of length 1-n) in a given text. | ||
/// It does so by hashing each ngram and using the hash value as the index in the bag. | ||
/// <see cref="IEstimator{TTransformer}"/> for the <see cref="ITransformer"/>. | ||
/// </summary> | ||
/// <remarks> | ||
/// <format type="text/markdown">< | | ||
/// | Output column data type | Vector of known-size of <xref:System.Single> | | ||
/// | ||
/// The resulting <xref:Microsoft.ML.ITransformer> creates a new column, named as specified in the output column name parameters, and | ||
/// produces a vector of n-gram counts (sequences of n consecutive words) from a given data. | ||
/// It does so by hashing each ngram and using the hash value as the index in the bag. | ||
/// | ||
/// <xref:Microsoft.ML.Transforms.Text.WordHashBagEstimator> is different from <xref:Microsoft.ML.Transforms.Text.NgramHashingEstimator> | ||
/// in that the former takes tokenizes text internally while the latter takes tokenized text as input. | ||
/// See the See Also section for links to examples of the usage. | ||
/// ]]> | ||
/// </format> | ||
/// </remarks> | ||
/// <seealso cref="TextCatalog.ProduceHashedWordBags(TransformsCatalog.TextTransforms, string, string, int, int, int, bool, uint, bool, int)" /> | ||
/// <seealso cref="TextCatalog.ProduceHashedWordBags(TransformsCatalog.TextTransforms, string, string[], int, int, int, bool, uint, bool, int)" /> | ||
public sealed class WordHashBagEstimator : IEstimator<ITransformer> | ||
{ | ||
private readonly IHost _host; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just do xref:Microsoft.ML.Data.KeyDataViewType #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's desirable to keep it as Ivan did, at least I believe that's the convention for key type: name it key, and link the KeyDataViewType.
In reply to: 277109991 [](ancestors = 277109991)