Emoji Unicode Category #277

codingedgar · 2020-01-16T18:04:10Z

I want a Regex to match emojis, i saw the example of

// Using flag A to match astral code points
XRegExp('^\\pS$').test('💩'); // -> false
XRegExp('^\\pS$', 'A').test('💩'); // -> true

But it's not specified that '(?A)^\\pS$' would match only emojis, maybe could be a \\p{Emoji} Category.

// Using flag A to match astral code points
XRegExp('^\\p{Emoji}$').test('💩'); // -> true
XRegExp('^\\pS$', 'A').test('\uD83D\uDCA9''); // -> false

to bring some spec: Unicode has other categories not listed under RL1.2 and talks about TR15(Emojis).

At the end of the day I just want to be able to match characters from iPhone and/or Android keyboard , but I cannot find anywhere the list of characters 😿

The text was updated successfully, but these errors were encountered:

josephfrazier · 2020-01-17T01:09:52Z

I'm a little rusty on how this works, but it looks like https://www.unicode.org/reports/tr51/#Emoji_Characters does indeed link to the data files needed in order to add this category to XRegExp as part of Unicode 12. I think we'll need to adapt the approach in #248 to upgrade to unicode-12.1.0. Might take a shot at it myself!

josephfrazier · 2020-01-17T01:26:49Z

Actually, it looks like @mathiasbynens (the publisher of the aforementioned unicode packages that we use in XRegExp) has published an emoji-regex package that you might be able to use, at least while XRegExp doesn't support the Emoji category.

codingedgar · 2020-01-18T20:34:18Z

Thanks for pointing to the packages, I saw the list of codes for Emoji is huge 😅. If Unicode 12.1.0 is added, then it means than there would be an Emoji category ? Sorry I don't know much about regex unicode/regex

mathiasbynens · 2020-01-19T18:34:20Z

Unicode defines a character property named "Emoji" but it only expands to single code points. It already works in JS RegExp: \p{Emoji}.

You probably want to match emoji sequences as well, though. https://github.com/tc39/proposal-regexp-unicode-sequence-properties aims to address this at the standards level. Until that happens, the emoji-regex package @josephfrazier pointed to above can be used.

juliovedovatto · 2020-04-08T23:19:02Z

I'm trying to find a way to use xregexp to replace all non-Latin chars, but preserving the basic emojis table.

Right now I developed this regex:

const regex = xregexp("[^\\s\\p{Latin}\\p{Common}]+", "g");

It works fine allowing just latin characters (and accents), but it also removes Emojis. Is there a way to combine with emoji-regex package or something else to maintain emojis as well?

@mathiasbynens @josephfrazier

eljeffeg · 2022-01-21T13:35:24Z

I see the library is now using Unicode 14.0.0. Has the emoji support been added? When I try \\p{Emoji} I get the error Unknown Unicode token \p{Emoji}

slevithan · 2022-01-21T19:20:31Z

XRegExp indeed uses Unicode 14.0.0 character data. It supports all Unicode scripts (via e.g. \p{Latin} or the native JS-compatible \p{Script=Latin}), all categories (e.g., \p{Letter} or its shorthand \p{L}), and just a few additional properties that are needed to meet the UTS 18 Level 1 RL1.2 requirements for Unicode regex support (see https://github.com/slevithan/xregexp/blob/master/src/addons/unicode-properties.js).

XRegExp does not include the Emoji property (which is now supported by native JS), but as @mathiasbynens pointed out above, that's probably not what you want anyway since it doesn't fully match the majority of emoji sequences, and does match things you almost certainly don't want to include such as 0, 1, and *.

It turns out, matching all/only what most people recognize as emoji is complicated. You probably want something like this, which works in native JS:

(?:\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F)(?:\u200D(?:\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F))*|[🇦-🇿]{2}

Note that this is significantly more robust than the example emoji regex @mathiasbynens gave at https://github.com/tc39/proposal-regexp-unicode-property-escapes#matching-emoji, since it includes standard flag sequences as well as emoji character sequences that include zero-width joiners (U+200D). However, this regex still does not match private use area (PUA) characters like https://emojipedia.org/apple-logo/ that will be rendered as emoji on some platforms. It does match platform-specific ZWJ sequences like https://emojipedia.org/ninja-cat/, but not e.g. https://emojipedia.org/refugee-nation-flag/ which uses a \p{Emoji} character not followed by U+FE0F (Variation Selector-16), unlike e.g. https://emojipedia.org/transgender-flag/ which correctly includes U+FE0F and therefore will be matched. The regex intentionally does not match Unicode characters intended to be displayed only as black and white glyphs (some of which are matched by \p{Emoji}) without a following U+FE0F. It also does not match tag sequences like https://emojipedia.org/flag-for-texas-ustx/ and https://emojipedia.org/flag-england/, but of course it could be updated to include those.

Aside: I'm curious how this compares to the /\p{RGI_Emoji}/u sequence property @mathiasbynens proposed support for in https://github.com/tc39/proposal-regexp-unicode-sequence-properties.

eljeffeg · 2022-01-22T15:42:03Z

Thanks for that.. just curious, why not add your own token to this library to do just that. If they ever come out with \p{RGI_Emoji} then just depreciate and alias your token (if it's different) to the standard. Make the first move and they can point to it already in use. Browsers do it all the time with proposed standards.

josephfrazier mentioned this issue Jan 17, 2020

Upgrade to Unicode 12.1.0 #278

Merged

slevithan closed this as completed Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Emoji Unicode Category #277

Emoji Unicode Category #277

codingedgar commented Jan 16, 2020

josephfrazier commented Jan 17, 2020

Uh oh!

josephfrazier commented Jan 17, 2020 •

edited

Loading

Uh oh!

codingedgar commented Jan 18, 2020

Uh oh!

mathiasbynens commented Jan 19, 2020

Uh oh!

juliovedovatto commented Apr 8, 2020

Uh oh!

eljeffeg commented Jan 21, 2022

Uh oh!

slevithan commented Jan 21, 2022 •

edited

Loading

Uh oh!

eljeffeg commented Jan 22, 2022 •

edited

Loading

Uh oh!

Uh oh!

Emoji Unicode Category #277

Emoji Unicode Category #277

Comments

codingedgar commented Jan 16, 2020

josephfrazier commented Jan 17, 2020

Uh oh!

josephfrazier commented Jan 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codingedgar commented Jan 18, 2020

Uh oh!

mathiasbynens commented Jan 19, 2020

Uh oh!

juliovedovatto commented Apr 8, 2020

Uh oh!

eljeffeg commented Jan 21, 2022

Uh oh!

slevithan commented Jan 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eljeffeg commented Jan 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

josephfrazier commented Jan 17, 2020 •

edited

Loading

slevithan commented Jan 21, 2022 •

edited

Loading

eljeffeg commented Jan 22, 2022 •

edited

Loading