-
Notifications
You must be signed in to change notification settings - Fork 34
Use custom unicode regex filter in place of emoji-regex #25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I maintain a project of GitHub emoji, at https://github.com/wooorm/gemoji. That pulls stuff in from GitHub Gemoji itself. Some things:
Regenerate makes sense so we can add only the values that GitHub supports. |
@zeke maybe you could help: How does GH create slugs from headings? Is the code open somewhere? This project ( P.S. I checked https://github.com/jch/html-pipeline but that doesn’t seem to do it. |
Hey friends. I remember asking around about this before and the answer was no, the slug generation code is not open source. But I can ask again though. |
👋 Ahh okay. Much appreciated! |
Thanks @zeke you're still the best! :) I've been pasting each unicode block of characters into markdown header on a gist and seeing what it pops out for the slug 😅. Knowing the ranges/blocks of unicode that are filtered would save a lot of time! |
@parthpp in the short term you can probably use I haven't made much headway on figuring out GitHub's internal process/codebase for generating slugs, but I will nudge the internal issue again. |
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js')) | ||
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js')) | ||
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js')) | ||
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js')) | |
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js')) | |
.add(require('unicode-12.1.0/Block/Halfwidth_And_Fullwidth_Forms/code-points.js')) |
I ran into this today, would you consider adding the Halfwidth and Fullwidth Forms block tp tjos PR?
Here's a link to the heading at issue: https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/README.ja.md#型定義ファイルとは何ですか-またどのように入手できますか
Here's the current github-slugger result vs. the correct slug:
-型定義ファイルとは何ですか?-またどのように入手できますか?
+型定義ファイルとは何ですか-またどのように入手できますか
I confirmed that this change (adding the Halfwidth and Fullwidth Forms block) fixes the slug for this heading.
.add(require('unicode-12.1.0/Sequence_Property/Basic_Emoji/index.js')) | ||
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js')) | ||
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js')) | ||
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js')) | ||
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js')) | ||
.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_A/code-points.js')) | ||
.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_B/code-points.js')) | ||
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols/code-points.js')) | ||
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Arrows/code-points.js')) | ||
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Pictographs/code-points.js')) | ||
.add(require('unicode-12.1.0/Block/Miscellaneous_Technical/code-points.js')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively the following also works, going by General_Category vs. Block/Sequence_Property:
.add(require('unicode-12.1.0/Sequence_Property/Basic_Emoji/index.js')) | |
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js')) | |
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js')) | |
.add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js')) | |
.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js')) | |
.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_A/code-points.js')) | |
.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_B/code-points.js')) | |
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols/code-points.js')) | |
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Arrows/code-points.js')) | |
.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Pictographs/code-points.js')) | |
.add(require('unicode-12.1.0/Block/Miscellaneous_Technical/code-points.js')) | |
.add(require('unicode-12.1.0/General_Category/Close_Punctuation/code-points.js')) | |
.add(require('unicode-12.1.0/General_Category/Open_Punctuation/code-points.js')) | |
.add(require('unicode-12.1.0/General_Category/Other_Punctuation/code-points.js')) | |
.add(require('unicode-12.1.0/General_Category/Symbol/code-points.js')) |
I reverse engineered GitHub’s slugging algorithm. Somewhat based on #25 and #35. To do that, I created two scripts: * `generate-fixtures.mjs`, which generates a markdown file, in part from manual fixtures and in part on the Unicode General Categories, creates a gist, crawls the gist, removes it, and saves fixtures annotated with the expected result from GitHub * `generate-regex.mjs`, which generates the regex that GitHub uses for characters to ignore. The regex is about 2.5kb minzipped. This increases the file size of this project a bit. But matching GitHub is worth it in my opinion. I also investigated regex `\p{}` classes in `/u` regexes. They work mostly fine, with two caveats: a) they don’t work everywhere, so would be a major release, b) GitHub does not implement the same Unicode version as browsers. I tested with Unicode 13 and 14, and they include characters that GitHub handles differently. In the end, GitHub’s algorithm is mostly fine: strip non-alphanumericals, allow `-`, and turn ` ` (space) into `-`. Finally, I removed the trim functionality, because it is not implemented by GitHub. To assert this, make a heading like so in a readme: `#  `. This is a space encoded as a character reference, meaning that the markdown does not see it as the whitespace between the `#` and the content. In fact, this makes it the content. And GitHub creates a slug of `-` for it. Further work: I think it would be nice to release this as is. Then, afterwards, I’d like to modernize the project, add GH Actions to generate the build, add types, and move to ESM. /cc @Flet @jablkojablko Closes GH-22. Closes GH-25. Closes GH-35. Co-authored-by: Dan Flettre <[email protected]> Co-authored-by: Jack Bates <[email protected]>
I reverse engineered GitHub’s slugging algorithm. Somewhat based on #25 and #35. To do that, I created two scripts: * `generate-fixtures.mjs`, which generates a markdown file, in part from manual fixtures and in part on the Unicode General Categories, creates a gist, crawls the gist, removes it, and saves fixtures annotated with the expected result from GitHub * `generate-regex.mjs`, which generates the regex that GitHub uses for characters to ignore. The regex is about 2.5kb minzipped. This increases the file size of this project a bit. But matching GitHub is worth it in my opinion. I also investigated regex `\p{}` classes in `/u` regexes. They work mostly fine, with two caveats: a) they don’t work everywhere, so would be a major release, b) GitHub does not implement the same Unicode version as browsers. I tested with Unicode 13 and 14, and they include characters that GitHub handles differently. In the end, GitHub’s algorithm is mostly fine: strip non-alphanumericals, allow `-`, and turn ` ` (space) into `-`. Finally, I removed the trim functionality, because it is not implemented by GitHub. To assert this, make a heading like so in a readme: `#  `. This is a space encoded as a character reference, meaning that the markdown does not see it as the whitespace between the `#` and the content. In fact, this makes it the content. And GitHub creates a slug of `-` for it. Closes GH-22. Closes GH-25. Closes GH-35. Closes GH-38. Co-authored-by: Dan Flettre <[email protected]> Co-authored-by: Jack Bates <[email protected]>
This isn't quite ready, but I wanted to share it in case someone wants to help! :)
OK, so
emoji-regex
does not cover a bunch of unicode characters that GitHub actually filters out, including the black heart and lozenge (#22)So, digging around, I found
regenerate
which can be used to create a unicode regex string and unicode-12.1.0 which has all of the unicode character blocks nicely defined. In fact,emoji-regex
usesunicode-12.1.0
to build its regex.This PR removes
emoji-regex
as a dependency and instead uses a script to build a custom regex from the unicode blocks incoded inemoji-regex
plus additional blocks that GitHub filters out when it creates slugs.So, I've been going through each unicode block (like this one and validating that GitHub filters it by pasting a same of characters into a markdown header in a GitHub Gist like this. I have a few covered but not all (its a tedious task 😢).
I'm guessing it will end up being a big range (or a few ranges) that GH filters from their slugs. If anyone has a better idea on how to do this, please feel free to help! :)