Use custom unicode regex filter in place of emoji-regex #25

Flet · 2019-06-26T01:51:16Z

This isn't quite ready, but I wanted to share it in case someone wants to help! :)

OK, so emoji-regex does not cover a bunch of unicode characters that GitHub actually filters out, including the black heart and lozenge (#22)

So, digging around, I found regenerate which can be used to create a unicode regex string and unicode-12.1.0 which has all of the unicode character blocks nicely defined. In fact, emoji-regex uses unicode-12.1.0 to build its regex.

This PR removes emoji-regex as a dependency and instead uses a script to build a custom regex from the unicode blocks incoded in emoji-regex plus additional blocks that GitHub filters out when it creates slugs.

So, I've been going through each unicode block (like this one and validating that GitHub filters it by pasting a same of characters into a markdown header in a GitHub Gist like this. I have a few covered but not all (its a tedious task 😢).

I'm guessing it will end up being a big range (or a few ranges) that GH filters from their slugs. If anyone has a better idea on how to do this, please feel free to help! :)

wooorm · 2019-06-26T07:04:11Z

I maintain a project of GitHub emoji, at https://github.com/wooorm/gemoji. That pulls stuff in from GitHub Gemoji itself. Some things:

They’re about to release a big new batch of gemoji, but that could be a while
Previously, their shortcodes sometimes mapped to values that are not seen as emoji (whole thing with font variant selectors and unicode), that’s likely to change in the coming release

Regenerate makes sense so we can add only the values that GitHub supports.
But, how do we know which values GH strips? Where’s the code they’re using? If we know this, we could in the future stay closer to it and have less of these problems.

wooorm · 2019-06-26T07:08:49Z

@zeke maybe you could help: How does GH create slugs from headings? Is the code open somewhere?

This project (github-slugger) is used by npm/unified/remark/others to mimic GH, but we don’t know exactly how GH does it, and that leads to bugs.

P.S. I checked https://github.com/jch/html-pipeline but that doesn’t seem to do it.

zeke · 2019-06-26T15:06:22Z

How does GH create slugs from headings? Is the code open somewhere?

Hey friends. I remember asking around about this before and the answer was no, the slug generation code is not open source. But I can ask again though.

wooorm · 2019-06-26T15:52:05Z

👋

Ahh okay. Much appreciated!
FWIW being able to peek at the code would be great, but in general pointers on how it works, instead of guessing, would help!

Flet · 2019-06-26T16:42:56Z

Thanks @zeke you're still the best! :)

I've been pasting each unicode block of characters into markdown header on a gist and seeing what it pops out for the slug 😅. Knowing the ranges/blocks of unicode that are filtered would save a lot of time!

parthpp · 2019-07-15T16:57:51Z

Hi @Flet and @wooorm , I am just following up. When can we expect a new version of this library to be available on NPM. Basically we are interested in #23 being available to use via NPM.

zeke · 2019-07-15T21:05:23Z

@parthpp in the short term you can probably use npm i Flet/github#custom-regex-filter to install directly from this branch on GitHub.

I haven't made much headway on figuring out GitHub's internal process/codebase for generating slugs, but I will nudge the internal issue again.

jablko · 2020-12-16T21:24:21Z

script/generate-regex.js

+  .add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js'))
+  .add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js'))
+  .add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js'))
+  .add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))


Suggested change

.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))

.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))

.add(require('unicode-12.1.0/Block/Halfwidth_And_Fullwidth_Forms/code-points.js'))

I ran into this today, would you consider adding the Halfwidth and Fullwidth Forms block tp tjos PR?

Here's a link to the heading at issue: https://github.com/DefinitelyTyped/DefinitelyTyped/blob/master/README.ja.md#型定義ファイルとは何ですか-またどのように入手できますか

Here's the current github-slugger result vs. the correct slug:

-型定義ファイルとは何ですか？-またどのように入手できますか？ +型定義ファイルとは何ですか-またどのように入手できますか

I confirmed that this change (adding the Halfwidth and Fullwidth Forms block) fixes the slug for this heading.

jablko · 2020-12-16T22:24:03Z

script/generate-regex.js

+  .add(require('unicode-12.1.0/Sequence_Property/Basic_Emoji/index.js'))
+  .add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js'))
+  .add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js'))
+  .add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js'))
+  .add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))
+  .add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_A/code-points.js'))
+  .add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_B/code-points.js'))
+  .add(require('unicode-12.1.0/Block/Miscellaneous_Symbols/code-points.js'))
+  .add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Arrows/code-points.js'))
+  .add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Pictographs/code-points.js'))
+  .add(require('unicode-12.1.0/Block/Miscellaneous_Technical/code-points.js'))


Alternatively the following also works, going by General_Category vs. Block/Sequence_Property:

Suggested change

.add(require('unicode-12.1.0/Sequence_Property/Basic_Emoji/index.js'))

.add(require('unicode-12.1.0/Sequence_Property/Emoji_Flag_Sequence/index.js'))

.add(require('unicode-12.1.0/Sequence_Property/Emoji_Modifier_Sequence/index.js'))

.add(require('unicode-12.1.0/Sequence_Property/Emoji_Tag_Sequence/index.js'))

.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))

.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_A/code-points.js'))

.add(require('unicode-12.1.0/Block/Miscellaneous_Mathematical_Symbols_B/code-points.js'))

.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols/code-points.js'))

.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Arrows/code-points.js'))

.add(require('unicode-12.1.0/Block/Miscellaneous_Symbols_And_Pictographs/code-points.js'))

.add(require('unicode-12.1.0/Block/Miscellaneous_Technical/code-points.js'))

.add(require('unicode-12.1.0/General_Category/Close_Punctuation/code-points.js'))

.add(require('unicode-12.1.0/General_Category/Open_Punctuation/code-points.js'))

.add(require('unicode-12.1.0/General_Category/Other_Punctuation/code-points.js'))

.add(require('unicode-12.1.0/General_Category/Symbol/code-points.js'))

@Flet

I reverse engineered GitHub’s slugging algorithm. Somewhat based on #25 and #35. To do that, I created two scripts: * `generate-fixtures.mjs`, which generates a markdown file, in part from manual fixtures and in part on the Unicode General Categories, creates a gist, crawls the gist, removes it, and saves fixtures annotated with the expected result from GitHub * `generate-regex.mjs`, which generates the regex that GitHub uses for characters to ignore. The regex is about 2.5kb minzipped. This increases the file size of this project a bit. But matching GitHub is worth it in my opinion. I also investigated regex `\p{}` classes in `/u` regexes. They work mostly fine, with two caveats: a) they don’t work everywhere, so would be a major release, b) GitHub does not implement the same Unicode version as browsers. I tested with Unicode 13 and 14, and they include characters that GitHub handles differently. In the end, GitHub’s algorithm is mostly fine: strip non-alphanumericals, allow `-`, and turn ` ` (space) into `-`. Finally, I removed the trim functionality, because it is not implemented by GitHub. To assert this, make a heading like so in a readme: `#  `. This is a space encoded as a character reference, meaning that the markdown does not see it as the whitespace between the `#` and the content. In fact, this makes it the content. And GitHub creates a slug of `-` for it. Further work: I think it would be nice to release this as is. Then, afterwards, I’d like to modernize the project, add GH Actions to generate the build, add types, and move to ESM. /cc @Flet @jablkojablko Closes GH-22. Closes GH-25. Closes GH-35. Co-authored-by: Dan Flettre <[email protected]> Co-authored-by: Jack Bates <[email protected]>

I reverse engineered GitHub’s slugging algorithm. Somewhat based on #25 and #35. To do that, I created two scripts: * `generate-fixtures.mjs`, which generates a markdown file, in part from manual fixtures and in part on the Unicode General Categories, creates a gist, crawls the gist, removes it, and saves fixtures annotated with the expected result from GitHub * `generate-regex.mjs`, which generates the regex that GitHub uses for characters to ignore. The regex is about 2.5kb minzipped. This increases the file size of this project a bit. But matching GitHub is worth it in my opinion. I also investigated regex `\p{}` classes in `/u` regexes. They work mostly fine, with two caveats: a) they don’t work everywhere, so would be a major release, b) GitHub does not implement the same Unicode version as browsers. I tested with Unicode 13 and 14, and they include characters that GitHub handles differently. In the end, GitHub’s algorithm is mostly fine: strip non-alphanumericals, allow `-`, and turn ` ` (space) into `-`. Finally, I removed the trim functionality, because it is not implemented by GitHub. To assert this, make a heading like so in a readme: `#  `. This is a space encoded as a character reference, meaning that the markdown does not see it as the whitespace between the `#` and the content. In fact, this makes it the content. And GitHub creates a slug of `-` for it. Closes GH-22. Closes GH-25. Closes GH-35. Closes GH-38. Co-authored-by: Dan Flettre <[email protected]> Co-authored-by: Jack Bates <[email protected]>

Use custom unicode regex filter in place of emoji-regex

25cdb15

jablko reviewed Dec 16, 2020

View reviewed changes

jablko mentioned this pull request Dec 17, 2020

Filter by Unicode General_Category #35

Closed

UziTech mentioned this pull request Mar 31, 2021

Document how header slugs get generated github/markup#1326

Closed

wooorm mentioned this pull request Aug 22, 2021

Fix to match GitHub’s algorithm on unicode #38

Merged

wooorm closed this in #38 Aug 24, 2021

UziTech mentioned this pull request Sep 15, 2021

Fix footnotes plus fix footnote reference labels and backrefs github/cmark-gfm#230

Merged

wooorm deleted the custom-regex-filter branch October 27, 2022 10:33

wooorm mentioned this pull request Mar 26, 2024

Slug generation for headings in web-viewed README.md in GitHub repos github/cmark-gfm#361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use custom unicode regex filter in place of emoji-regex #25

Use custom unicode regex filter in place of emoji-regex #25

Uh oh!

Flet commented Jun 26, 2019

Uh oh!

wooorm commented Jun 26, 2019

Uh oh!

wooorm commented Jun 26, 2019

Uh oh!

zeke commented Jun 26, 2019

Uh oh!

wooorm commented Jun 26, 2019

Uh oh!

Flet commented Jun 26, 2019

Uh oh!

parthpp commented Jul 15, 2019

Uh oh!

zeke commented Jul 15, 2019

Uh oh!

jablko Dec 16, 2020

Uh oh!

jablko Dec 16, 2020

Uh oh!

Uh oh!

	.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))
	.add(require('unicode-12.1.0/Sequence_Property/Emoji_ZWJ_Sequence/index.js'))
	.add(require('unicode-12.1.0/Block/Halfwidth_And_Fullwidth_Forms/code-points.js'))

Use custom unicode regex filter in place of emoji-regex #25

Use custom unicode regex filter in place of emoji-regex #25

Uh oh!

Conversation

Flet commented Jun 26, 2019

Uh oh!

wooorm commented Jun 26, 2019

Uh oh!

wooorm commented Jun 26, 2019

Uh oh!

zeke commented Jun 26, 2019

Uh oh!

wooorm commented Jun 26, 2019

Uh oh!

Flet commented Jun 26, 2019

Uh oh!

parthpp commented Jul 15, 2019

Uh oh!

zeke commented Jul 15, 2019

Uh oh!

jablko Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

jablko Dec 16, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!