-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Transform wrappers and a reference implementation for tokenizers #931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
text: ctx.LoadFloat(1)), hasHeader: true) | ||
.Read(new MultiFileSource(sentimentDataPath)); | ||
|
||
var est = new WordTokenizer(Env, "text", "words") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordTokenizer [](start = 26, length = 13)
Whats the purpose of TransformWrapper? Is it to quickly manufacture transformers from existing transform? because I don't see the Pigsty extensions to it? Will it require additional work? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is exactly for that.
Pigsty comes extra, see the subsequent iterations
In reply to: 218245812 [](ancestors = 218245812)
/// <param name="env">The environment.</param> | ||
/// <param name="columns">Pairs of columns to run the tokenization on.</param> | ||
/// <param name="advancedSettings">Any advanced settings to be applied.</param> | ||
public CharacterTokenizer(IHostEnvironment env, (string input, string output)[] columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(string input, string output) [](start = 56, length = 29)
name-tuple array is sometime quite confusing when passing arguments e.g. when there are 2 or more tuples.
Also, maybe you are missing params
keyword. #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the pattern for the transformer conversion so far.
params
is not possible here because of the useMarkerCharacters
In reply to: 218246352 [](ancestors = 218246352)
|
||
/// <summary> | ||
/// A fake schema that is manufactured out of a SchemaShape. | ||
/// It will pretend that all vector sizes are equal to 10, all key value counts are equal to 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 [](start = 59, length = 2)
Is there any specific reason for choosing 10?...:) #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be more than one, that's the only limitation I'm aware of.
In reply to: 218253037 [](ancestors = 218253037,218248145)
{ | ||
ColumnType curType = inputCol.ItemType; | ||
if (inputCol.IsKey) | ||
curType = new KeyType(curType.AsPrimitive.RawKind, 0, 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 [](start = 70, length = 2)
I see it being used at multiple places. Can we define a const
for it? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
var dataPipe = _xf; | ||
var transforms = new List<IDataTransform>(); | ||
while (dataPipe is IDataTransform xf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataPipe is IDataTransform xf [](start = 19, length = 29)
It seems only for linear chain like pipeline. Is it possible that pipelines have graph like transformation structure? #Closed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -2,6 +2,7 @@ | |||
// The .NET Foundation licenses this file to you under the MIT license. | |||
// See the LICENSE file in the project root for more information. | |||
|
|||
using Microsoft.ML.Transforms.Text; | |||
using Microsoft.ML.Data.StaticPipe; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhh. So nice to see a nice fresh PR. And you know what I'm going to say as my very first comment?
You know as well as I do. I always do it of course. I'd even go so far as to call it a tradition. A hoary tradition, by this point.
Just pausing for a moment, to relish the moment. You understand. :D Because, you know, I figure I deserve it at this point. It may seem a little self indulgent, and I suppose it is, but you know, I have to do it. Because, without it, I just wouldn't be me, you feel me?
So here it is:
Do you want to sort your using
s? :) #Resolved
/// <param name="inputColumn">The column containing text to tokenize.</param> | ||
/// <param name="outputColumn">The column containing output tokens. Null means <paramref name="inputColumn"/> is replaced.</param> | ||
/// <param name="separators">The separators to use (comma separated).</param> | ||
public WordTokenizer(IHostEnvironment env, string inputColumn, string outputColumn = null, string separators = "space") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string separators = "space" [](start = 99, length = 27)
This is a great thing to have as the arguments to a command line. To an API, we should use actual char
or char[]
. #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started with char
, then I discovered that multiple separators can be used.
I am all for making opportunistic changes to the transforms to make them friendlier and more API-appropriate, but only if this doesn't take too much extra time / lines of code.
I also put a REVIEW somewhere saying exactly that.
In reply to: 218261276 [](ancestors = 218261276)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to not block this PR on this issue though.
In reply to: 218288071 [](ancestors = 218288071,218261276)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just looked at the code changes. They are not too big, but I'd rather file an issue and do it later than do it now. Are you OK with that?
In reply to: 218288153 [](ancestors = 218288153,218288071,218261276)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment on line 55 and will create a GH issue
In reply to: 218289358 [](ancestors = 218289358,218288153,218288071,218261276)
using Microsoft.ML.Runtime.Data; | ||
using Microsoft.ML.Runtime.Internal.Utilities; | ||
using System.Collections.Generic; | ||
using System.Linq; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to follow on the list of people that get picky about using sorting, i have seen both Microsoft.ML than System, and vice-versa.
we gotta sort this out #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we use what VS does when you do Ctrl-K-G (sort usings). It sorts alphabetically.
In reply to: 218328459 [](ancestors = 218328459)
/// <summary> | ||
/// Estimator for untrained wrapped transformers. | ||
/// </summary> | ||
public abstract class TrivialWrapperEstimator : TrivialEstimator<TransformWrapper> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TrivialWrapperEstimator [](start = 26, length = 23)
why not call it like the comment: UntrainedWrapperEstimator #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this name. 'Untrained' could mean 'not yet trained', or 'not trainable at all'. The current name 'trivial wrapper estimator' makes more sense as it's a TrivialEstimator of a Wrapper.
In reply to: 218329322 [](ancestors = 218329322)
…learning into feature/transform-wrapper
/// </summary> | ||
/// <param name="input">The column to apply to.</param> | ||
/// <param name="useMarkerCharacters">Whether to use marker characters to separate words.</param> | ||
public static VarVector<Key<ushort, string>> TokenizeIntoCharacters(this Scalar<string> input, bool useMarkerCharacters = true) => new OutPipelineColumn(input, useMarkerCharacters); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this Scalar [](start = 76, length = 20)
I think char tokenizer also works on vector of string. Do you plan to support that extension as well? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like that implicit behavior. For now, I would rather only have uncontroversial Pigsty extensions, go by what makes sense vs. what is currently supported/possible
In reply to: 218547643 [](ancestors = 218547643)
/// </summary> | ||
/// <param name="input">The column to apply to.</param> | ||
/// <param name="separators">The separators to use (comma separated).</param> | ||
public static VarVector<string> TokenizeText(this Scalar<string> input, string separators = "space") => new OutPipelineColumn(input, separators); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"space" [](start = 100, length = 7)
I think we should make them actual space(' ') or tab ('\t') on the front facing API instead of using keywords. What do you think? #ByDesign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not know there is an issue already to handle that. I am ok with that.
In reply to: 218574836 [](ancestors = 218574836,218548057)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// <param name="columns">Pairs of columns to run the tokenization on.</param> | ||
/// <param name="useMarkerCharacters">Whether to use marker characters to separate words.</param> | ||
public CharacterTokenizer(IHostEnvironment env, (string input, string output)[] columns, bool useMarkerCharacters = true) | ||
: base(Contracts.CheckRef(env, nameof(env)).Register(nameof(WordTokenizer)), MakeTransformer(env, columns, useMarkerCharacters)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WordTokenizer [](start = 72, length = 13)
CharacterTokenizer??? #Resolved
/// <summary> | ||
/// Estimator for trained wrapped transformers. | ||
/// </summary> | ||
internal abstract class TrainedWrapperEstimatorBase : IEstimator<TransformWrapper> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
internal [](start = 4, length = 8)
Is it internal
by design? I need to use it to convert Ngram transform.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
public abstract TransformWrapper Fit(IDataView input); | ||
|
||
public SchemaShape GetOutputSchema(SchemaShape inputSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public [](start = 8, length = 6)
we should all make it a habit of documenting everything public .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it apply to interface methods? IEstimator.GetOutputSchema
is pretty well documented.
In reply to: 218630155 [](ancestors = 218630155)
public void Tokenize() | ||
{ | ||
var env = new ConsoleEnvironment(seed: 0); | ||
var dataPath = GetDataPath("wikipedia-detox-250-line-data.tsv"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
h("wikipedia-detox-250-line-data.tsv" [](start = 37, length = 37)
should use the datasets from the TestDatasets.cs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I merged before I saw your comments. Will do in one of the next PRs.
In reply to: 218631455 [](ancestors = 218631455)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No description provided.