Skip to content

Add more Unicode planes to regular_char #174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stasm opened this issue Aug 28, 2018 · 2 comments
Closed

Add more Unicode planes to regular_char #174

stasm opened this issue Aug 28, 2018 · 2 comments

Comments

@stasm
Copy link
Contributor

stasm commented Aug 28, 2018

Let's extend the regular_char production to be more permissive of characters from outside of BMP. A good standard to follow is https://www.w3.org/TR/REC-xml/#NT-Char.

fluent/syntax/grammar.mjs

Lines 411 to 417 in 3c7cd30

/* Any Unicode character from BMP excluding C0 control characters, space,
* surrogate blocks and non-characters (U+FFFE, U+FFFF).
* Cf. https://www.w3.org/TR/REC-xml/#NT-Char
* TODO Add characters from other planes: U+10000 to U+10FFFF.
*/
let regular_char =
charset("\u0021-\uD7FF\uE000-\uFFFD");

@stasm
Copy link
Contributor Author

stasm commented Oct 12, 2018

Extending the reference parser to support astral Unicode planes turned out to be easy thanks to Unicode-aware regexes in ES2015. I opened #179 with the proposed implementation.

@stasm
Copy link
Contributor Author

stasm commented Oct 12, 2018

The definition of NT-Char in the XML spec comes with the following note:

Note:

    Document authors are encouraged to avoid "compatibility characters",
    as defined in section 2.3 of [Unicode]. The characters defined in the
    following ranges are also discouraged. They are either control characters
    or permanently undefined Unicode characters:

    [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDEF],
    [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
    [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
    [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
    [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
    [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
    [#x10FFFE-#x10FFFF].

Should we include something similar in the Fluent spec?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant