Skip to content

Emoji Unicode Category #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
codingedgar opened this issue Jan 16, 2020 · 8 comments
Closed

Emoji Unicode Category #277

codingedgar opened this issue Jan 16, 2020 · 8 comments

Comments

@codingedgar
Copy link

I want a Regex to match emojis, i saw the example of

// Using flag A to match astral code points
XRegExp('^\\pS$').test('💩'); // -> false
XRegExp('^\\pS$', 'A').test('💩'); // -> true

But it's not specified that '(?A)^\\pS$' would match only emojis, maybe could be a \\p{Emoji} Category.

// Using flag A to match astral code points
XRegExp('^\\p{Emoji}$').test('💩'); // -> true
XRegExp('^\\pS$', 'A').test('\uD83D\uDCA9''); // -> false

to bring some spec: Unicode has other categories not listed under RL1.2 and talks about TR15(Emojis).

At the end of the day I just want to be able to match characters from iPhone and/or Android keyboard , but I cannot find anywhere the list of characters 😿

@josephfrazier
Copy link
Collaborator

I'm a little rusty on how this works, but it looks like https://www.unicode.org/reports/tr51/#Emoji_Characters does indeed link to the data files needed in order to add this category to XRegExp as part of Unicode 12. I think we'll need to adapt the approach in #248 to upgrade to unicode-12.1.0. Might take a shot at it myself!

@josephfrazier
Copy link
Collaborator

josephfrazier commented Jan 17, 2020

Actually, it looks like @mathiasbynens (the publisher of the aforementioned unicode packages that we use in XRegExp) has published an emoji-regex package that you might be able to use, at least while XRegExp doesn't support the Emoji category.

@codingedgar
Copy link
Author

Thanks for pointing to the packages, I saw the list of codes for Emoji is huge 😅. If Unicode 12.1.0 is added, then it means than there would be an Emoji category ? Sorry I don't know much about regex unicode/regex

@mathiasbynens
Copy link
Collaborator

Unicode defines a character property named "Emoji" but it only expands to single code points. It already works in JS RegExp: \p{Emoji}.

You probably want to match emoji sequences as well, though. https://github.com/tc39/proposal-regexp-unicode-sequence-properties aims to address this at the standards level. Until that happens, the emoji-regex package @josephfrazier pointed to above can be used.

@juliovedovatto
Copy link

I'm trying to find a way to use xregexp to replace all non-Latin chars, but preserving the basic emojis table.

Right now I developed this regex:

const regex = xregexp("[^\\s\\p{Latin}\\p{Common}]+", "g");

It works fine allowing just latin characters (and accents), but it also removes Emojis. Is there a way to combine with emoji-regex package or something else to maintain emojis as well?

@mathiasbynens @josephfrazier

@eljeffeg
Copy link

I see the library is now using Unicode 14.0.0. Has the emoji support been added? When I try \\p{Emoji} I get the error Unknown Unicode token \p{Emoji}

@slevithan
Copy link
Owner

slevithan commented Jan 21, 2022

XRegExp indeed uses Unicode 14.0.0 character data. It supports all Unicode scripts (via e.g. \p{Latin} or the native JS-compatible \p{Script=Latin}), all categories (e.g., \p{Letter} or its shorthand \p{L}), and just a few additional properties that are needed to meet the UTS 18 Level 1 RL1.2 requirements for Unicode regex support (see https://github.com/slevithan/xregexp/blob/master/src/addons/unicode-properties.js).

XRegExp does not include the Emoji property (which is now supported by native JS), but as @mathiasbynens pointed out above, that's probably not what you want anyway since it doesn't fully match the majority of emoji sequences, and does match things you almost certainly don't want to include such as 0, 1, and *.

It turns out, matching all/only what most people recognize as emoji is complicated. You probably want something like this, which works in native JS:

(?:\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F)(?:\u200D(?:\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F))*|[🇦-🇿]{2}

Note that this is significantly more robust than the example emoji regex @mathiasbynens gave at https://github.com/tc39/proposal-regexp-unicode-property-escapes#matching-emoji, since it includes standard flag sequences as well as emoji character sequences that include zero-width joiners (U+200D). However, this regex still does not match private use area (PUA) characters like https://emojipedia.org/apple-logo/ that will be rendered as emoji on some platforms. It does match platform-specific ZWJ sequences like https://emojipedia.org/ninja-cat/, but not e.g. https://emojipedia.org/refugee-nation-flag/ which uses a \p{Emoji} character not followed by U+FE0F (Variation Selector-16), unlike e.g. https://emojipedia.org/transgender-flag/ which correctly includes U+FE0F and therefore will be matched. The regex intentionally does not match Unicode characters intended to be displayed only as black and white glyphs (some of which are matched by \p{Emoji}) without a following U+FE0F. It also does not match tag sequences like https://emojipedia.org/flag-for-texas-ustx/ and https://emojipedia.org/flag-england/, but of course it could be updated to include those.

Aside: I'm curious how this compares to the /\p{RGI_Emoji}/u sequence property @mathiasbynens proposed support for in https://github.com/tc39/proposal-regexp-unicode-sequence-properties.

@eljeffeg
Copy link

eljeffeg commented Jan 22, 2022

Thanks for that.. just curious, why not add your own token to this library to do just that. If they ever come out with \p{RGI_Emoji} then just depreciate and alias your token (if it's different) to the standard. Make the first move and they can point to it already in use. Browsers do it all the time with proposed standards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants