Skip to content

XML documentation for five text related transforms #3418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 19, 2019

Conversation

artidoro
Copy link
Contributor

@artidoro artidoro commented Apr 18, 2019

Tracked by #3204.

This PR covers:

  • TokenizeCharacters
  • NormalizeText
  • ExtractWordEmbeddings
  • TokenizeWords
  • ProduceNgrams

@artidoro artidoro self-assigned this Apr 18, 2019
@artidoro artidoro added the documentation Related to documentation of ML.NET label Apr 18, 2019
@codecov
Copy link

codecov bot commented Apr 18, 2019

Codecov Report

Merging #3418 into master will increase coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3418      +/-   ##
==========================================
+ Coverage    72.7%   72.73%   +0.02%     
==========================================
  Files         807      807              
  Lines      145171   145206      +35     
  Branches    16225    16230       +5     
==========================================
+ Hits       105551   105613      +62     
+ Misses      35201    35176      -25     
+ Partials     4419     4417       -2
Flag Coverage Δ
#Debug 72.73% <100%> (+0.02%) ⬆️
#production 68.27% <100%> (+0.03%) ⬆️
#test 88.98% <ø> (ø) ⬆️
Impacted Files Coverage Δ
...osoft.ML.Transforms/Text/TokenizingByCharacters.cs 95.32% <ø> (ø) ⬆️
...soft.ML.Transforms/Text/WordEmbeddingsExtractor.cs 87.52% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/WordTokenizing.cs 79.15% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/TextCatalog.cs 41.66% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/NgramUtils.cs 89.25% <ø> (ø) ⬆️
src/Microsoft.ML.Transforms/Text/NgramTransform.cs 87.42% <ø> (ø) ⬆️
...rc/Microsoft.ML.Transforms/Text/TextNormalizing.cs 97.11% <100%> (ø) ⬆️
src/Microsoft.ML.Core/Data/ProgressReporter.cs 70.95% <0%> (-6.99%) ⬇️
src/Microsoft.ML.Maml/MAML.cs 24.75% <0%> (-1.46%) ⬇️
src/Microsoft.ML.Data/Transforms/KeyToValue.cs 79.16% <0%> (-0.65%) ⬇️
... and 19 more

/// | Input column data type | Vector of [Keys](<xref:Microsoft.ML.Data.KeyDataViewType>) |
/// | Output column data type | Known-sized vector of <xref:System.Single> |
///
/// The resulting [NgramExtractingTransformer]<xref:Microsoft.ML.Transforms.Text.NgramExtractingTransformer>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NgramExtractingTransformer] [](start = 23, length = 27)

If you make it as link you should put <xref> in braces [text](url) #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And i'm not sure we even want make it url, what is wrong with just <xref>


In reply to: 276868333 [](ancestors = 276868333)

Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the anchor text is the same as the class/type name, you don't really need to use anchor pattern. You can save some time by using autolinks: xref:UID

also we don't have []<> pattern. this won't work as is.


In reply to: 276868428 [](ancestors = 276868428,276868333)

[EnumValueDisplay("TF (Term Frequency)")]
Tf = 0,

/// <summary>Inverse Document Frequency. A ratio (the logarithm of inverse relative frequency)
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// <summary [](start = 12, length = 12)

thank you for adding this.
Should we make it

<summary>
text
</summary>

? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right!


In reply to: 276869289 [](ancestors = 276869289)

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | <xref:System.ReadOnlyMemory{System.Char}> |
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref:System.ReadOnlyMemory{System.Char} [](start = 35, length = 41)

Scalar or vector ? #Resolved

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a vector of keys.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector [](start = 46, length = 6)

unknown size vector #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is variable-sized vector fine?


In reply to: 276869836 [](ancestors = 276869836)

/// Normalizes incoming text in <paramref name="inputColumnName"/> by changing case, removing diacritical marks, punctuation marks and/or numbers
/// and outputs new text as <paramref name="outputColumnName"/>.
/// Creates a <see cref="TextNormalizingEstimator"/>, which normalizes incoming text in <paramref name="inputColumnName"/> by changing case,
/// removing diacritical marks, punctuation marks and/or numbers and outputs new text as <paramref name="outputColumnName"/>.
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it reads as it's doing casing removing diacritics and punctuations, and only for numbers you have option to decide. #Closed

/// This column's data type will remain scalar of text or a vector of text depending on the input column data type.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>,
/// the value of the <paramref name="outputColumnName"/> will be used as source.
/// This estimator operates on text and vector of text data types.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and [](start = 44, length = 3)

or? #Closed

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// The output column is of type variable vector of string.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string [](start = 60, length = 6)

text here and below #WontFix

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not public member I will not fix it for now.


In reply to: 276870603 [](ancestors = 276870603)

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a vector of text.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector [](start = 46, length = 6)

unknown size vector #ByDesign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is variable-sized fine?


In reply to: 276870655 [](ancestors = 276870655)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variable-sized (as discussed in the chat)


In reply to: 277070083 [](ancestors = 277070083,276870655)

@@ -124,10 +130,16 @@ public static class TextCatalog
=> new TextNormalizingEstimator(Contracts.CheckRef(catalog, nameof(catalog)).GetEnvironment(),
outputColumnName, inputColumnName, caseMode, keepDiacritics, keepPunctuations, keepNumbers);

/// <include file='doc.xml' path='doc/members/member[@name="WordEmbeddings"]/*' />
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'doc.xml' [](start = 26, length = 9)

if doc.xml is no longer used, please remove the file #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is still used by the trasnforms that Ivan, and senja are workign on . I would rather remove it at the end, and not modify it to avoid conflicts.


In reply to: 277032552 [](ancestors = 277032552)

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a vector of <see cref="System.Single"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>,
/// the value of the <paramref name="outputColumnName"/> will be used as source.
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

source [](start = 81, length = 6)

let's not use 'source'. just drop 'as source'. it reads fine without it. #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can modify it here, but it's everywhere, that's the pattern that a lot of our transforms follow.


In reply to: 277032960 [](ancestors = 277032960)

@@ -161,10 +179,13 @@ public static class TextCatalog
=> new WordEmbeddingEstimator(Contracts.CheckRef(catalog, nameof(catalog)).GetEnvironment(),
outputColumnName, customModelFile, inputColumnName ?? outputColumnName);

/// <include file='doc.xml' path='doc/members/member[@name="WordEmbeddings"]/*' />
/// <summary>
/// Create an <see cref="WordEmbeddingEstimator"/>, which is a text featurizer that converts vectors
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vectors [](start = 101, length = 7)

singular or plural? if it's plural, please say 'multiple vectors' or 'one or more vectors' #Resolved

/// <param name="ngramLength">Ngram length.</param>
/// <param name="skipLength">Maximum number of tokens to skip when constructing an ngram.</param>
/// <param name="skipLength">Number of tokens to skip between each ngram. By defaults no token is skipped.</param>
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defaults [](start = 85, length = 8)

default (no s) #Resolved

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | <xref:System.ReadOnlyMemory{System.Char}> |
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref:System.ReadOnlyMemory{System.Char} [](start = 35, length = 41)

we might need to change this. i'll let you know. #Resolved

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | No |
/// | Input column data type | Vector of <xref:System.ReadOnlyMemory{System.Char}> |
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<x [](start = 45, length = 2)

please add [text] as anchor-text #Resolved

/// | Input column data type | Vector of <xref:System.ReadOnlyMemory{System.Char}> |
/// | Output column data type | Known-sized vector of <xref:System.Single> |
///
/// The [WordEmbeddingTransformer](<xref:Microsoft.ML.Transforms.Text.WordEmbeddingTransformer>) produces a new column,
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(xref:Microsoft.ML.Transforms.Text.WordEmbeddingTransformer) [](start = 38, length = 62)

fix #Resolved

@@ -795,33 +826,43 @@ internal WordEmbeddingEstimator(IHostEnvironment env, string customModelFile, pa
/// </summary>
public enum PretrainedModelKind
{
/// <summary>GloVe 50 dimensional word embeddings.</summary>
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

///

GloVe 50 dimensional word embeddings. [](start = 12, length = 60)

///


/// GloVe 50 dimensional word embeddings.

#Resolved

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | No |
/// | Input column data type | <xref:System.ReadOnlyMemory{System.Char}> |
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 34, length = 1)

add [text] for anchor text #Resolved

Copy link

@shmoradims shmoradims left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// The output column is of type variable vector of string.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
/// This column should be of type string.</param>
/// <param name="separators">The separators to use (uses space character by default).</param>
Copy link

@shmoradims shmoradims Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't change 'internal' stuff. it's extra work for you and for review. #Resolved

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a vector of <see cref="System.Single"/>.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector [](start = 46, length = 6)

known-sized #Resolved

/// <param name="inputColumnName">Name of the column to transform.</param>
/// <param name="customModelFile">The path of the pre-trained embeddings model to use.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a vector of <see cref="System.Single"/>.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vector [](start = 46, length = 6)

known-sized #Resolved

/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be a variable-sized vector of text.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
/// This estimator operates of text data type.</param>
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text data type [](start = 39, length = 14)

we support just scalar or vector #Resolved

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | No |
/// | Input column data type | [Text](xref:Microsoft.ML.Data.TextDataViewType) or Vector of [Text](xref:Microsoft.ML.Data.TextDataViewType) |
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[](start = 82, length = 2)

double space #Resolved

/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | No |
/// | Input column data type | [Text](xref:Microsoft.ML.Data.TextDataViewType) or Vector of [Text](xref:Microsoft.ML.Data.TextDataViewType) |
/// | Output column data type | The same as the data type in the input column |
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as the data type in the input column [](start = 36, length = 45)

We drop empty strings, so your vector would end-up variable vector. #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Apr 19, 2019

            result[colInfo.outputColumnName] = new SchemaShape.Column(colInfo.outputColumnName, col.Kind == SchemaShape.Column.VectorKind.Vector ? SchemaShape.Column.VectorKind.VariableVector : SchemaShape.Column.VectorKind.Scalar, col.ItemType, false);

We don't support variable vector? That's weird. #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextNormalizing.cs:545 in 3f4b997. [](commit_id = 3f4b997, deletion_comment = False)

@Ivanidzo4ka
Copy link
Contributor

            result[colInfo.outputColumnName] = new SchemaShape.Column(colInfo.outputColumnName, col.Kind == SchemaShape.Column.VectorKind.Vector ? SchemaShape.Column.VectorKind.VariableVector : SchemaShape.Column.VectorKind.Scalar, col.ItemType, false);

I mean if it's vector make it variable vector and in any other case let it be scalar.
Looks like bug to me.


In reply to: 485010348 [](ancestors = 485010348)


Refers to: src/Microsoft.ML.Transforms/Text/TextNormalizing.cs:545 in 3f4b997. [](commit_id = 3f4b997, deletion_comment = False)

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | Yes |
/// | Input column data type | Scalar of [Text](xref:Microsoft.ML.Data.TextDataViewType) |
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scalar [](start = 35, length = 6)

I don't see any checks for that. We just check that type is text, which can be true for Scalar or Vector. #Resolved

/// | | |
/// | -- | -- |
/// | Does this estimator need to look at the data to train its parameters? | No |
/// | Input column data type | Scalar of [Text](xref:Microsoft.ML.Data.TextDataViewType) |
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Apr 19, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scalar [](start = 35, length = 6)

Check IsColumnTypeValid() we don't care about is it scalar or vector as long as underlying type is Text #Resolved

@artidoro
Copy link
Contributor Author

            result[colInfo.outputColumnName] = new SchemaShape.Column(colInfo.outputColumnName, col.Kind == SchemaShape.Column.VectorKind.Vector ? SchemaShape.Column.VectorKind.VariableVector : SchemaShape.Column.VectorKind.Scalar, col.ItemType, false);

Good catch just fixed it.


In reply to: 485010465 [](ancestors = 485010465,485010348)


Refers to: src/Microsoft.ML.Transforms/Text/TextNormalizing.cs:545 in 3f4b997. [](commit_id = 3f4b997, deletion_comment = False)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@artidoro artidoro merged commit 7d26e61 into dotnet:master Apr 19, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Related to documentation of ML.NET
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants