Skip to content

Use custom unicode regex filter in place of emoji-regex #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Flet
Copy link
Owner

@Flet Flet commented Jun 26, 2019

This isn't quite ready, but I wanted to share it in case someone wants to help! :)

OK, so emoji-regex does not cover a bunch of unicode characters that GitHub actually filters out, including the black heart and lozenge (#22)

So, digging around, I found regenerate which can be used to create a unicode regex string and unicode-12.1.0 which has all of the unicode character blocks nicely defined. In fact, emoji-regex uses unicode-12.1.0 to build its regex.

This PR removes emoji-regex as a dependency and instead uses a script to build a custom regex from the unicode blocks incoded in emoji-regex plus additional blocks that GitHub filters out when it creates slugs.

So, I've been going through each unicode block (like this one and validating that GitHub filters it by pasting a same of characters into a markdown header in a GitHub Gist like this. I have a few covered but not all (its a tedious task 😢).

I'm guessing it will end up being a big range (or a few ranges) that GH filters from their slugs. If anyone has a better idea on how to do this, please feel free to help! :)

@wooorm
Copy link
Collaborator

wooorm commented Jun 26, 2019

I maintain a project of GitHub emoji, at https://github.com/wooorm/gemoji. That pulls stuff in from GitHub Gemoji itself. Some things:

  • They’re about to release a big new batch of gemoji, but that could be a while
  • Previously, their shortcodes sometimes mapped to values that are not seen as emoji (whole thing with font variant selectors and unicode), that’s likely to change in the coming release

Regenerate makes sense so we can add only the values that GitHub supports.
But, how do we know which values GH strips? Where’s the code they’re using? If we know this, we could in the future stay closer to it and have less of these problems.

@wooorm
Copy link
Collaborator

wooorm commented Jun 26, 2019

@zeke maybe you could help: How does GH create slugs from headings? Is the code open somewhere?

This project (github-slugger) is used by npm/unified/remark/others to mimic GH, but we don’t know exactly how GH does it, and that leads to bugs.

P.S. I checked https://github.com/jch/html-pipeline but that doesn’t seem to do it.

@zeke
Copy link

zeke commented Jun 26, 2019

How does GH create slugs from headings? Is the code open somewhere?

Hey friends. I remember asking around about this before and the answer was no, the slug generation code is not open source. But I can ask again though.

@wooorm
Copy link
Collaborator

wooorm commented Jun 26, 2019

👋

Ahh okay. Much appreciated!
FWIW being able to peek at the code would be great, but in general pointers on how it works, instead of guessing, would help!

@Flet
Copy link
Owner Author

Flet commented Jun 26, 2019

Thanks @zeke you're still the best! :)

I've been pasting each unicode block of characters into markdown header on a gist and seeing what it pops out for the slug 😅. Knowing the ranges/blocks of unicode that are filtered would save a lot of time!

@parthpp
Copy link
Contributor

parthpp commented Jul 15, 2019

Hi @Flet and @wooorm , I am just following up. When can we expect a new version of this library to be available on NPM. Basically we are interested in #23 being available to use via NPM.

@zeke
Copy link

zeke commented Jul 15, 2019

@parthpp in the short term you can probably use npm i Flet/github#custom-regex-filter to install directly from this branch on GitHub.

I haven't made much headway on figuring out GitHub's internal process/codebase for generating slugs, but I will nudge the internal issue again.

.add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))
.add(require('unicode-12.1.0/Block/Halfwidth_And_Fullwidth_Forms/code-points.js'))

I ran into this today, would you consider adding the Halfwidth and Fullwidth Forms block tp tjos PR?

Here's a link to the heading at issue: https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/README.ja.md#型定義ファイルとは何ですか-またどのように入手できますか

Here's the current github-slugger result vs. the correct slug:

-型定義ファイルとは何ですか?-またどのように入手できますか?
+型定義ファイルとは何ですか-またどのように入手できますか

I confirmed that this change (adding the Halfwidth and Fullwidth Forms block) fixes the slug for this heading.

Comment on lines +7 to +17
.add(require('unicode-12.1.0/Sequence_Property/Basic_Emoji/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_A/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_B/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Arrows/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Pictographs/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Technical/code-points.js'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively the following also works, going by General_Category vs. Block/Sequence_Property:

Suggested change
.add(require('unicode-12.1.0/Sequence_Property/Basic_Emoji/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js'))
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_A/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_B/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Arrows/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Pictographs/code-points.js'))
.add(require('unicode-12.1.0/Block/Miscellaneous_Technical/code-points.js'))
.add(require('unicode-12.1.0/General_Category/Close_Punctuation/code-points.js'))
.add(require('unicode-12.1.0/General_Category/Open_Punctuation/code-points.js'))
.add(require('unicode-12.1.0/General_Category/Other_Punctuation/code-points.js'))
.add(require('unicode-12.1.0/General_Category/Symbol/code-points.js'))

wooorm added a commit that referenced this pull request Aug 22, 2021
I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.

To do that, I created two scripts:

* `generate-fixtures.mjs`, which generates a markdown file, in part
  from manual fixtures and in part on the Unicode General Categories,
  creates a gist, crawls the gist, removes it, and saves fixtures
  annotated with the expected result from GitHub
* `generate-regex.mjs`, which generates the regex that GitHub uses for
  characters to ignore.

The regex is about 2.5kb minzipped.
This increases the file size of this project a bit.
But matching GitHub is worth it in my opinion.
I also investigated regex `\p{}` classes in `/u` regexes. They work
mostly fine, with two caveats:
a) they don’t work everywhere, so would be a major release,
b) GitHub does not implement the same Unicode version as browsers.
I tested with Unicode 13 and 14, and they include characters that
GitHub handles differently.
In the end, GitHub’s algorithm is mostly fine: strip
non-alphanumericals, allow `-`, and turn ` ` (space) into `-`.

Finally, I removed the trim functionality, because it is not
implemented by GitHub.
To assert this, make a heading like so in a readme: `#  `.
This is a space encoded as a character reference, meaning that the
markdown does not see it as the whitespace between the `#` and the
content.
In fact, this makes it the content.
And GitHub creates a slug of `-` for it.

Further work: I think it would be nice to release this as is.
Then, afterwards, I’d like to modernize the project, add GH Actions
to generate the build, add types, and move to ESM.

/cc @Flet @jablkojablko

Closes GH-22.
Closes GH-25.
Closes GH-35.

Co-authored-by: Dan Flettre <[email protected]>
Co-authored-by: Jack Bates <[email protected]>
@wooorm wooorm closed this in #38 Aug 24, 2021
wooorm added a commit that referenced this pull request Aug 24, 2021
I reverse engineered GitHub’s slugging algorithm.
Somewhat based on #25 and #35.

To do that, I created two scripts:

* `generate-fixtures.mjs`, which generates a markdown file, in part
  from manual fixtures and in part on the Unicode General Categories,
  creates a gist, crawls the gist, removes it, and saves fixtures
  annotated with the expected result from GitHub
* `generate-regex.mjs`, which generates the regex that GitHub uses for
  characters to ignore.

The regex is about 2.5kb minzipped.
This increases the file size of this project a bit.
But matching GitHub is worth it in my opinion.
I also investigated regex `\p{}` classes in `/u` regexes. They work
mostly fine, with two caveats:
a) they don’t work everywhere, so would be a major release,
b) GitHub does not implement the same Unicode version as browsers.
I tested with Unicode 13 and 14, and they include characters that
GitHub handles differently.
In the end, GitHub’s algorithm is mostly fine: strip
non-alphanumericals, allow `-`, and turn ` ` (space) into `-`.

Finally, I removed the trim functionality, because it is not
implemented by GitHub.
To assert this, make a heading like so in a readme: `# &#x20;`.
This is a space encoded as a character reference, meaning that the
markdown does not see it as the whitespace between the `#` and the
content.
In fact, this makes it the content.
And GitHub creates a slug of `-` for it.

Closes GH-22.
Closes GH-25.
Closes GH-35.
Closes GH-38.

Co-authored-by: Dan Flettre <[email protected]>
Co-authored-by: Jack Bates <[email protected]>
@wooorm wooorm deleted the custom-regex-filter branch October 27, 2022 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants