Skip to content

OutputTokens option in FeaturizeText API #2985

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Mar 19, 2019

Conversation

abgoswam
Copy link
Member

@abgoswam abgoswam commented Mar 15, 2019

@codecov
Copy link

codecov bot commented Mar 15, 2019

Codecov Report

Merging #2985 into master will increase coverage by <.01%.
The diff coverage is 90%.

@@            Coverage Diff             @@
##           master    #2985      +/-   ##
==========================================
+ Coverage   72.35%   72.35%   +<.01%     
==========================================
  Files         803      803              
  Lines      143296   143296              
  Branches    16155    16155              
==========================================
+ Hits       103675   103678       +3     
+ Misses      35194    35192       -2     
+ Partials     4427     4426       -1
Flag Coverage Δ
#Debug 72.35% <90%> (ø) ⬆️
#production 68.06% <71.42%> (ø) ⬆️
#test 88.52% <100%> (ø) ⬆️
Impacted Files Coverage Δ
test/Microsoft.ML.Functional.Tests/Debugging.cs 100% <100%> (ø) ⬆️
...osoft.ML.Tests/Transformers/TextFeaturizerTests.cs 99.56% <100%> (ø) ⬆️
...oft.ML.Transforms/Text/TextFeaturizingEstimator.cs 83.2% <71.42%> (ø) ⬆️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 85.31% <0%> (+0.6%) ⬆️

@abgoswam abgoswam changed the title WIP : OutputTokens option in FeaturizeText API OutputTokens option in FeaturizeText API Mar 18, 2019
@@ -111,8 +111,8 @@ public sealed class Options : TransformInputBase
[Argument(ArgumentType.AtMostOnce, HelpText = "Whether to keep numbers or remove them.", ShortName = "num", SortOrder = 8)]
public bool KeepNumbers = TextNormalizingEstimator.Defaults.KeepNumbers;

[Argument(ArgumentType.AtMostOnce, HelpText = "Whether to output the transformed text tokens as an additional column.", ShortName = "tokens,showtext,showTransformedText", SortOrder = 9)]
public bool OutputTokens;
[Argument(ArgumentType.AtMostOnce, HelpText = "Column containing the transformed text tokens.", ShortName = "OutputTokens,tokens,showtext,showTransformedText", SortOrder = 9)]
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputTokens,tokens,showtext,showTransformedText [](start = 121, length = 48)

not sure it make much sense to preserve old names since you basically change logic of this field. #Resolved

@Ivanidzo4ka
Copy link
Contributor

Ivanidzo4ka commented Mar 18, 2019

                verReadableCur: 0x00010001,

I would verify what your changes doesn't affect loading/saving models old and new models. #Resolved


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:675 in 5d6a2f6. [](commit_id = 5d6a2f6, deletion_comment = False)

@@ -43,7 +43,7 @@ public void TextFeaturizerWorkout()
.AsDynamic;

var feat = data.MakeNewEstimator()
.Append(row => row.text.FeaturizeText(options: new TextFeaturizingEstimator.Options { OutputTokens = true, }));
.Append(row => row.text.FeaturizeText(options: new TextFeaturizingEstimator.Options { OutputTokensColumnName = "Data_TransformedText", }));
Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data_TransformedText [](start = 129, length = 20)

No need in _ anymore, you can name it as you want! #Resolved

@abgoswam
Copy link
Member Author

abgoswam commented Mar 18, 2019

                verReadableCur: 0x00010001,

I verified it and it works.

  • Using master branch, i saved the model used by the test InspectIntermediatePipelineSteps. The test sets OutputTokens=true and validates the transformed data has the tokens column Features_TransformedText

  • I then load the model using this branch . I use the loaded model to validate the test case InspectIntermediatePipelineSteps runs through the same checks successfully . Note : since I am using the saved model, I commented out the pipeline creation / model fitting steps when running the test in this PR branch

The test went through successfully, validating this change does not affect the loading/saving of old models with this change.

I believe the reason this works is because while saving the model, the FeaturizeText API saves any submodels it may have used when training. And when loading the saved models, it just loads up saved sub-models. We have only modified the high-level flag used to indicate saving the tokenization sub-model. (Line 255)


In reply to: 474023125 [](ancestors = 474023125)


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:675 in 5d6a2f6. [](commit_id = 5d6a2f6, deletion_comment = False)

Copy link
Contributor

@Ivanidzo4ka Ivanidzo4ka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -35,7 +35,7 @@ public static void Example()
{
KeepPunctuations = false,
KeepNumbers = false,
OutputTokens = true,
OutputTokensColumnName = "OutputTokens",
Copy link
Contributor

@rogancarr rogancarr Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OutputTokens [](start = 42, length = 12)

TokenizedText perhaps? #Pending

@rogancarr
Copy link
Contributor

rogancarr commented Mar 18, 2019

        OutputColumn = name;

Check for collision of OutputTokensColumnName with OutputColumn. #Pending


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:334 in 40eb3f4. [](commit_id = 40eb3f4, deletion_comment = False)

Copy link
Contributor

@rogancarr rogancarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs one safety check.

☘️

@abgoswam
Copy link
Member Author

        OutputColumn = name;

Will treat it as bug fix #3002 .

Do not want to block this PR on it since we are near code complete for project 13 issues


In reply to: 474100886 [](ancestors = 474100886)


Refers to: src/Microsoft.ML.Transforms/Text/TextFeaturizingEstimator.cs:334 in 40eb3f4. [](commit_id = 40eb3f4, deletion_comment = False)

@abgoswam abgoswam merged commit fbbc222 into dotnet:master Mar 19, 2019
@abgoswam abgoswam deleted the abgoswam/featurize_text branch March 20, 2019 20:13
@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FeaturizeText outputTokens uses a magical string to name a new column
3 participants