-
Notifications
You must be signed in to change notification settings - Fork 510
Improve word tokenization for non-Latin characters #328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,25 +1,19 @@ | ||
import Diff from './base'; | ||
import {generateOptions} from '../util/params'; | ||
|
||
// Based on https://en.wikipedia.org/wiki/Latin_script_in_Unicode | ||
// | ||
// Ranges and exceptions: | ||
// Latin-1 Supplement, 0080–00FF | ||
// - U+00D7 × Multiplication sign | ||
// - U+00F7 ÷ Division sign | ||
// Latin Extended-A, 0100–017F | ||
// Latin Extended-B, 0180–024F | ||
// IPA Extensions, 0250–02AF | ||
// Spacing Modifier Letters, 02B0–02FF | ||
// - U+02C7 ˇ ˇ Caron | ||
// - U+02D8 ˘ ˘ Breve | ||
// - U+02D9 ˙ ˙ Dot Above | ||
// - U+02DA ˚ ˚ Ring Above | ||
// - U+02DB ˛ ˛ Ogonek | ||
// - U+02DC ˜ ˜ Small Tilde | ||
// - U+02DD ˝ ˝ Double Acute Accent | ||
// Latin Extended Additional, 1E00–1EFF | ||
const extendedWordChars = /^[a-zA-Z\u{C0}-\u{FF}\u{D8}-\u{F6}\u{F8}-\u{2C6}\u{2C8}-\u{2D7}\u{2DE}-\u{2FF}\u{1E00}-\u{1EFF}]+$/u; | ||
const spaceChars = ' \f\t\v\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff'; | ||
let charsCannotBecomeWord = ''; | ||
charsCannotBecomeWord += '\n\r'; | ||
charsCannotBecomeWord += | ||
'\u0000-\u002F\u003A-\u0040\u005B-\u0060\u007B-\u007E'; // Basic Latin | ||
charsCannotBecomeWord += '\u00A0-\u00BF\u00D7\u00F7'; // Latin-1 Supplement | ||
charsCannotBecomeWord += '\u02B9-\u02DD\u02E5-\u02FF'; // Spacing Modifier Letters | ||
charsCannotBecomeWord += '\u0300-\u036F'; // Combining Diacritical Marks | ||
charsCannotBecomeWord += '\u1000-\u1FAFF'; // Mahjong Tiles - Symbols and Pictographs Extended-A | ||
charsCannotBecomeWord += '\u2000-\u2BFF'; // General Punctuation - Miscellaneous Symbols and Arrows | ||
charsCannotBecomeWord += '\u3000-\u303F'; // CJK Symbols and Punctuation | ||
const spaceRegExp = new RegExp(`[${spaceChars}]`); | ||
const cannotBecomeWordRegExp = new RegExp(`[${charsCannotBecomeWord}]`); | ||
|
||
const reWhitespace = /\S/; | ||
|
||
|
@@ -32,21 +26,29 @@ wordDiff.equals = function(left, right) { | |
return left === right || (this.options.ignoreWhitespace && !reWhitespace.test(left) && !reWhitespace.test(right)); | ||
}; | ||
wordDiff.tokenize = function(value) { | ||
// All whitespace symbols except newline group into one token, each newline - in separate token | ||
let tokens = value.split(/([^\S\r\n]+|[()[\]{}'"\r\n]|\b)/); | ||
|
||
// Join the boundary splits that we do not consider to be boundaries. This is primarily the extended Latin character set. | ||
for (let i = 0; i < tokens.length - 1; i++) { | ||
// If we have an empty string in the next field and we have only word chars before and after, merge | ||
if (!tokens[i + 1] && tokens[i + 2] | ||
&& extendedWordChars.test(tokens[i]) | ||
&& extendedWordChars.test(tokens[i + 2])) { | ||
tokens[i] += tokens[i + 2]; | ||
tokens.splice(i + 1, 2); | ||
i--; | ||
const tokens = []; | ||
let prevCharType = ''; | ||
for (let i = 0; i < value.length; i++) { | ||
const char = value[i]; | ||
if (spaceRegExp.test(char)) { | ||
if(prevCharType === 'space') { | ||
tokens[tokens.length - 1] += ' '; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the edge case where there's a long run of spaces in the text somewhere, this is going to take O(n^2) time where n is the number of consecutive spaces. It also rewrites all space characters of any kind to an ordinary Both these aspects of the behaviour here seem wrong! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. It seems like an excessive change. |
||
} else { | ||
tokens.push(' '); | ||
} | ||
prevCharType = 'space'; | ||
} else if (cannotBecomeWordRegExp.test(char)) { | ||
tokens.push(char); | ||
prevCharType = ''; | ||
} else { | ||
if(prevCharType === 'word') { | ||
tokens[tokens.length - 1] += char; | ||
} else { | ||
tokens.push(char); | ||
} | ||
prevCharType = 'word'; | ||
Comment on lines
+29
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we need logic changes here? It's not obvious to me why the fix for Korean text should involve anything more than just treating Korean letters as letters instead of word boundaries... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like this is also an excessive change. My intention was to separate characters that should exist as single characters into individual tokens. (For example: U+0021, U+0022, ...) |
||
} | ||
} | ||
|
||
return tokens; | ||
}; | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... this can't be right. Treating combining diacritics as word breaks is definitely wrong for European languages that I am familiar with (and I would've thought that it was wrong universally), since it leads to outcomes like this:
That diff ought to just show the one word in the old text being deleted and a totally new word from the new text being added, but instead it shows us preserving the code point for the grave accent in the middle of the word and adding and deleting stuff around the accent. That's nonsense from the perspective of French or any other language I know with accents; the
è
(an e with a grave accent) in impubère is simply a letter in the word, just like theb
andr
next to it, and doesn't semantically represent a word break.Do combining diacritics work differently in Korean (... if they even exist in Korean?) or some other language you're familiar with such that it makes sense to treat them as word breaks? I am afraid I don't speak any non-European languages and am a bit out of my depth! I can imagine in theory a language where accents get used as punctuation, but I have never encountered such a thing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't considered such cases due to my limited experience with other languages. It seems like an aspect that requires further thought.