-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fixed the TextTransform bug where chargrams where being computed differently when using with/without word tokenizer. #548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…erently for differnt settings.
shall we bump version if we change transform implementation? @TomFinley #Resolved Refers to: src/Microsoft.ML.Transforms/Text/CharTokenizeTransform.cs:71 in 868a72f. [](commit_id = 868a72f, deletion_comment = False) |
values[index++] = TextStartMarker; | ||
|
||
if (i > 0) | ||
values[index++] = ' '; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
' ' [](start = 50, length = 3)
I'm a bit struggle to understand this solution.
Correct me if I'm wrong.
You get broken into peaces sentence by word extractor.
You make assumption it was broken by space character, so you just add it to chargram.
You distinguish case of processed /not processed by wordgram by size of your vector, is it always true?
You handle only space, what if i broke words by comma, tab, semicolon?
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Space was not selected as the breaking character between tokens in a vector explicitly because it was quite likely to occur within individual tokens. This is why we chose a character unlikely to occur in text (vs. one of the most frequently observed characters) in the first place. :)
In reply to: 203213571 [](ancestors = 203213571)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there is no way to get the token that was used for tokenization? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reality, if an statistical technique for word tokenization is applied then there is no reliable way to get the actual text after it has been tokenized.
So, use of space is just an approximation. Otherwise, it not possible to get actual text string. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, there is no way to get the token that was used for tokenization?
No. Indeed you have absolutely no idea whether the tokenization scheme involved breaking on single characters at all -- a vector of text could have come from anywhere. Indeed, more elaborate tokenization schemes exist, that involve an actual language model and so are considerably more elaborate than a simple string.Split(char)
, which of course is the assumption made by this "fix."
I suggest that we not try to get into the business of trying to guess how the data that was fed into a transform, since doing so degrades the system. It would do more harm than good to do it -- we've built a modular system of transforms and data pipelines, but the consequence of a modular system is that you must accept that your data could come from literally anywhere.
What I believe is a reasonable compromise, is that some sort of special character be used, instead of space... perhaps characters not used often in text, but that is nonetheless specially reserved to indicate the start and end of a given block of text.
Happily, I believe that was the state of affairs of this transform in its existing implementation, before it was "fixed." 😄 #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misclicked and marked this as approve, which is not correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zeahmed . This code is problematic. In particular, that the treatment of a vector of text, vs. an input text is treated as a distinct scenario is actually deliberate, as we ,may guess from the fact that it was coded that way.
The solution to the problem of "make transform X equivalent to whether transform Y was applied ahead of it or not," is for the user to simply not freakin' apply transform Y. That is unambiguously the correct solution, since it will also be more clear, and more efficient.
@TomFinley, actually that's not the solution in the context of TextTransform...:) In TextTransform, it is quite often desirable to compute both wordgrams and chargrams. If wordgrams are being computed then text will be tokenized and character tokenizer will see vector of text instead of just scalar text. Text will also be tokenized when "stop words removal" option is set where user wants to remove stop words before computing chargrams. So, character tokenizer will compute chargrams that are not valid chargrams (cf. #530). The first case can be avoided by applying character tokenizer directly on original text column (before word tokenizer). However, 2nd case can not be avoided. That's the only solution right now I see. #Closed |
Certainly if we were to change the transform along these lines we would have to do that. Backcompat is important. In reply to: 405757792 [](ancestors = 405757792) Refers to: src/Microsoft.ML.Transforms/Text/CharTokenizeTransform.cs:71 in 868a72f. [](commit_id = 868a72f, deletion_comment = False) |
The original issue was: the chargrams produced differed depending on if n-grams were also produced. This oddity should be fixed. I'm unsure the order we run the transforms, but I assume we can run as:
Let me know if I'm wrong... I think to produce consistency we can have the character tokenizer read from This still produces the artifact of One stronger way to produce consistency, is to always have the character tokenizer read from I prefer [Option A]. Thoughts @Ivanidzo4ka, @TomFinley, @zeahmed? #Resolved |
I'm no so familiar with Text transform pipeline, but I would prefer not to apply char tokenizer on top of word tokenizer. They should be treated independently, and consume same data source. In reply to: 405858371 [](ancestors = 405858371) |
Hi @justinormont , a single character delimiter to represent a transition vs. two seems fine and beneficial to me. The unit separator seems like a fine choice, considering its description. Of course we must ensure that we don't break backcompat of previously serialized and composed models.
Then you have one path that has stopwords removed, and another that does not. I don't see how that is more consistent. I don't even know what we're arguing is inconsistent in the current scheme.
@Ivanidzo4ka , why? I wasn't involved in engineering this, but it seems like the right choice for the n-grams for words or chars to happen over the post-processed data (including stopword removal, if requested), rather than prior to processing. #Resolved |
I think this is in my current fix in this PR except that
The only problem I see from previous comments is whether it is good to do this fix in CharTokenizeTransform or not? In my opinion, it should be done in TextTransform. However, it does not seem possible unless there is a kind of string concate transform that can concatenate strings with separator (let me know if there is a way to do that?). @justinormont , yes [Option A] does seem a reasonable approach with Thanks @justinormont , @Ivanidzo4ka and @TomFinley for your useful insights, comments and feedback. In reply to: 406017378 [](ancestors = 406017378) |
Right. That's a big difference, and a poor choice since the space is likely to occur in actual text. Not to harp on this, but the major problem with your PR of course is that you broke backwards compatibility for models. #Resolved |
Currently, if word tokenization is requested, TextNormalization is applied after word tokenization. Should it be reversed? Is there any significance of TextNormalization being more effective after word tokenization is applied? In reply to: 405858371 [](ancestors = 405858371) |
Following are the changes in the new commit.
|
|
||
dst = new VBuffer<ushort>(len, values, dst.Indices); | ||
}; | ||
return CurrentModelVersion < version.VerReadableCur ? valueGetterOldVersion : valueGetterCurrentVersion; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomFinley and @Ivanidzo4ka, Is this correct way to maintain backward compatibility? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not exactly. For one thing you've stored this in a static variable, which is not correct, since this behavior will be specific to the instance. So if you read two instances of the transform, one with the old version, one with the new version, you'd break one or the other (depending on the order in which you read them). As a general rule, not just in ML.NET but in any coding anywhere, storing instance specific data inside a static variable is generally a bad idea. :)
For the sake of resaving an old model to the new format, you ought to store a boolean variable in the new version of the file, to control whether it is the old or new version. So the saver will have ctx.Writer.WriteBoolByte(_isStartEnd)
and in the reader have something like _isStartEnd = ctx.Header.ModelVerReadable < 0x00010002 || ctx.Reader.ReadBoolByte()
. This hypothetical _isStartEnd
will be private readonly
, and emphatically not public static
.
If anything about that is not clear let me know, thanks @zeahmed!
In reply to: 203896119 [](ancestors = 203896119)
@@ -262,6 +262,30 @@ public static IDataTransform Create(IHostEnvironment env, Arguments args, IDataV | |||
view = new ConcatTransform(h, new ConcatTransform.Arguments() { Column = xfCols }, view); | |||
} | |||
|
|||
if (tparams.NeedsNormalizeTransform) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinormont, I moved TextNormalizer
above WordTokenizer
. Let me know if there is any adverse effect of it? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if there are averse effects of running TextNormalizer
before WordTokenizer
. I expect it is different, but assume not adversely so. For instance (depending on the specifics of the tokenizer), I don't
can be split into {I, do, n't}
but if removing punctuation first, it may be split to {I, dont}
.
If you make me a build, I'll run a manual regression test. Unfortunately we don't currently have running nightly regression tests to tell us if the text datasets decreased in their core metrics.
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this change. It's far from clear to me that if someone asks to remove stopwords that they meant for the content of stopwords to be retained in the chargrams. At the very least this ought to be a configurable option.
In reply to: 204000786 [](ancestors = 204000786)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomFinley, Can you elaborate more on this? Your point is not clear to me. Don't you see it good to have TextNormalizer
applied before WordTokenizer
?
In reply to: 204201054 [](ancestors = 204201054,204000786)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry thought you were saying something else for some reason got confused. #Resolved
@@ -64,12 +64,13 @@ public sealed class Arguments : TransformInputBase | |||
public const string LoaderSignature = "CharToken"; | |||
public const string UserName = "Character Tokenizer Transform"; | |||
|
|||
public static uint CurrentModelVersion = 0x00010002; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentModelVersion [](start = 27, length = 19)
Hmmm. Just to clarify something, because having this be a constant is essential for the correctness of your serialization and deserialization routines, turning it into a public static mutable field is probably not a good idea.
I question the worth of making this variable * at all*, but even if you should please don't make this public, and this should definitely be a const
. Also "current" version is not descriptive. It is the version where you changed the behavior of the separating characters to be the single character <US>
vs having start and end characters. If you really want a variable, it should be something like VariableSeparatorIsUnitSeparator
or somesuch to actually describe what version 0x00010002
is. But again I'd question how much worth such a const as this would provide. #Resolved
private static VersionInfo GetVersionInfo() | ||
{ | ||
return new VersionInfo( | ||
modelSignature: "CHARTOKN", | ||
verWrittenCur: 0x00010001, // Initial | ||
verReadableCur: 0x00010001, | ||
verWrittenCur: CurrentModelVersion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verWrittenCur: CurrentModelVersion, [](start = 16, length = 35)
Please refer to other components where we bumped the version number. (If I search for 0x00010002
I see 43 matching files, so you should be able to find examples easily.) You will see that in those other versions we have a comment that describes why the version was changed. This is blank. #Resolved
private static VersionInfo GetVersionInfo() | ||
{ | ||
return new VersionInfo( | ||
modelSignature: "CHARTOKN", | ||
verWrittenCur: 0x00010001, // Initial |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verWrittenCur: 0x00010001, // Initial [](start = 16, length = 37)
See other examples. You will see that we keep commented out the version strings from old versions, to track them and so we know why we bumped the version each time. This is important: we have live code in our deserializers that is meant to handle those version bumps, and for that reason we like to have documentation on what exactly changed, so that when we have tests on this or that version in our deserializers, we know why they changed. #Resolved
@@ -119,6 +121,7 @@ private CharTokenizeTransform(IHost host, ModelLoadContext ctx, IDataView input) | |||
// <base> | |||
// byte: _useMarkerChars value. | |||
_useMarkerChars = ctx.Reader.ReadBoolByte(); | |||
CurrentModelVersion = ctx.Header.ModelVerReadable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CurrentModelVersion [](start = 12, length = 19)
Oh I see what you've done here.. this is not good, static variables are essentially global state. They are not particular to an instance. #Resolved
@@ -120,6 +125,9 @@ private CharTokenizeTransform(IHost host, ModelLoadContext ctx, IDataView input) | |||
// byte: _useMarkerChars value. | |||
_useMarkerChars = ctx.Reader.ReadBoolByte(); | |||
|
|||
var version = GetVersionInfo(); | |||
_isSeparatorStartEnd = ctx.Header.ModelVerReadable < version.VerReadableCur || ctx.Reader.ReadBoolByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version.VerReadableCur [](start = 65, length = 22)
This is fragile and will break as soon as we rev the version again. just please compare to 0x00010002
. #Closed
Hi @zeahmed looking pretty good, I think my only concern is the fragile version check I mentioned, otherwise looks fine. #Resolved |
@@ -102,6 +107,7 @@ public CharTokenizeTransform(IHostEnvironment env, Arguments args, IDataView inp | |||
|
|||
_type = GetOutputColumnType(); | |||
SetMetadata(); | |||
_isSeparatorStartEnd = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI there is no need for this. The default value of bool is false. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zeahmed !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TomFinley, @justinormont and @Ivanidzo4ka! |
…erently when using with/without word tokenizer. (dotnet#548)
…erently when using with/without word tokenizer. (dotnet#548)
This PR fixes #530. The cause of the problem was
word tokenizer
being applied beforechar tokenizer
causing scalar valued text (e.g.This is a cat
) to become vector (e.g. <This, is, a, cat>).Previously, char tokenizer treated every vector item as separate text item (e.g. computing chargrams on each item by placing start and end markers
<STX>token<ETX>
instead of takingThis is a cat
as single text item).The fix is in CharTokenizeTransform. The CharTokenizeTransform can take either a scalar or vector column as input. The processing of chargrams are done as follows.
If the input column is a scalar with text type then chargrams are computed by prepending
<STX>
and appending<ETX>
characters at the start and at the end of the text respectively. For example, if the input isThis is a cat
then chargrams will be computed on<STX>This is a cat<ETX>
.If the input column is a vector with text type (it could be a result of concatenation of several text columns together or application of word tokenizer before char tokenizer) then chargrams will be computed by prepending
<STX>
and appending<ETX>
characters at start and at the end of the vector respectively. Also, characters are inserted after every item in the vector. For example, if the input is<This, is, a, cat>
then chargrams will be computed on<STX>This<US>is<US>a<US>cat<ETX>
.To be backward compatible, CharTokenizerTransform version was bumped up and the support for loading models saved with previous version is added.
Moving forward, the chargrams will be computed as follows
stop word removal
is request, chargrams will be computed afterStopwordRemovalTransform
is applied e.g.<STX>This<US>is<US>a<US>cat<ETX>
.This is a cat
.