-
Notifications
You must be signed in to change notification settings - Fork 716
[cssom][all specs defining IDL] Consider USVString instead of DOMString, replacing surrogates with U+FFFD #1217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Shouldn't the test be |
Additional note about Since Preserving unpaired surrogates would mean giving up on the Rust native way of representing in-memory textual data and fighting the Rust ecosystem when it comes to calling into other code, etc. |
@zcorpan, hmm you’re right. New test case: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5022 <!DOCTYPE html>
<style></style>
<script>
document.documentElement.classList.add('\uD800');
document.styleSheets[0].insertRule('@media all { \
:root.\uD800:not(.\D801):not(.\uFFFD):before { content: "Surrogates seem to be preserved." } \
:root:not(.\uD800):not(.\uFFFD):before { content: "Surrogates seem to be mapped to U+FFFD in CSSOM but not in DOM." } \
:root.\uD801:before { content: "Surrogates seem to be mapped to U+FFFD in both CSSOM and DOM." } \
}', 0);
</script> (The "both" case is Servo.) |
Unpaired surrogates can only happen when injected by badly-written (or, maybe, malicious?) scripts. There is no actual use case for having them, and being required to preserve them is unfortunate. We should take this opportunity to dump them. |
Per https://heycam.github.io/webidl/#idl-USVString, DOMString is meant to be preferred unless there's a good reason to need only scalar values. In the previous discussion, the major argument against making CSS Unicode-clean was the perf impact of the extra scanning step required. Is that still valid? |
That advice is debatable: whatwg/webidl#84 The performance impact is non-zero for implementations that use WTF-16 internally. (Though I don’t know how significant it is.) In 2014 when all vendors at the table were in that case, “Unicode is nicer” by itself was not a good enough reason. However Stylo is now coming to Firefox, hopefully shipping in November 2017. It currently uses UTF-8 without surrogates internally, and given the trade-off (no use case & low estimated web-compat risk v.s. internal API compat issues of using a different string type) this is unlikely to change before shipping. I want at least to give heads up to the WG that Firefox will likely deviate from currently-interoperable behavior. |
The CSS Working Group just discussed Consider using USVString instead of DOMString, and agreed to the following resolutions:
The full IRC log of that discussion
|
The observable behavior difference is whether surrogate code units are preserved or replaced with U+FFFD REPLACEMENT CHARACTER. CSSWG resolved to allow either in CSSOM: w3c/csswg-drafts#1217 (comment)
|
TIL WebIDL has |
We can't use a union type of But we can typedef to only |
Implementations can choose one or the other. CSSWG resolution: w3c#1217 (comment)
CSSWG resolution: w3c#1217 (comment) Fix w3c#1217. Each occurrence is one of: * CSS syntax * A name (for example a property name) that also occurs in CSS syntax * `Stylesheet::type`, which is always `text/css`. * `Stylesheet::title`, which is set from the eponymous HTML content attribute of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title) and [`<link>`>](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title) elements.
CSSWG resolution: w3c#1217 (comment) Fix w3c#1217. Each occurrence is one of: * CSS syntax * A name (for example a property name) that also occurs in CSS syntax * `Stylesheet::type`, which is always `text/css`. * `Stylesheet::title`, which is set from the eponymous HTML content attribute of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title) and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title) elements.
CSSWG resolution: w3c#1217 (comment) Fix w3c#1217. Each occurrence is one of: * CSS syntax * A name (for example a property name) that also occurs in CSS syntax * `Stylesheet::type`, which is always `text/css`. * `Stylesheet::title`, which is set from the eponymous HTML content attribute of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title) and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title) elements.
CSSWG resolution: w3c#1217 (comment) Fix w3c#1217. Each occurrence is one of: * CSS syntax * A name (for example a property name) that also occurs in CSS syntax * `Stylesheet::type`, which is always `text/css`. * `Stylesheet::title`, which is set from the eponymous HTML content attribute of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title) and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title) elements.
CSSWG resolution: w3c#1217 (comment) Fix w3c#1217. Each occurrence is one of: * CSS syntax * A name (for example a property name) that also occurs in CSS syntax * `Stylesheet::type`, which is always `text/css`. Not replaced: * `Stylesheet::title`, which is set from the eponymous HTML content attribute of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title) and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title) elements. These contant attributes are reflected as `HTMLElement::title` DOM attributes, where they are `DOMString`.
Implementations can choose one or the other. CSSWG resolution: #1217 (comment)
CSSWG resolution: #1217 (comment) Fix #1217. Each occurrence is one of: * CSS syntax * A name (for example a property name) that also occurs in CSS syntax * `Stylesheet::type`, which is always `text/css`. Not replaced: * `Stylesheet::title`, which is set from the eponymous HTML content attribute of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title) and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title) elements. These contant attributes are reflected as `HTMLElement::title` DOM attributes, where they are `DOMString`.
As an FYI, And this does mean that tests like https://github.com/w3c/web-platform-tests/blob/master/html/dom/usvstring-reflection.html will be impossible to add for CSSOM. |
At least tests could check that an implementation is consistently either DOMString or USVString for all CSSOMString things. |
CSSOM uses WebIDL’s
DOMString
type for all string parameters and return values. It corresponds to JavaScript strings: arbitrary sequences of 16-bit code units. These are usually interpreted as UTF-16, but they’re not necessarily well-formed in UTF-16: they can contain unpaired surrogate code units. I sometimes call this encoding WTF-16.(Character encoding decoders never emit surrogates when decoding bytes from the network, even when decoding UTF-16BE or UTF-16LE. So surrogates can’t end up in a string that way, only through JS.)
WebIDL also defines
USVString
which is a Unicode string. (A sequence of Unicode scalar values, excluding surrogate code points.) When converting to it from a JavaScript string, unpaired surrogate are replaced with the replacement character U+FFFD.As far as I know all major browser engines currently use WTF-16 internally, so they preserve unpaired surrogates "by default" when strings go through various browser components where no code is actively looking for those.
In Firefox, we’re working on a new style system (Stylo, a.k.a. Quantum CSS) where strings are represented with Rust’s native
&str
type.&str
uses UTF-8 bytes for its in-memory representation of Unicode and guarantees (as part of the type’s contract) that these bytes are well-formed in UTF-8. Unicode designed UTF-8 to specifically exclude surrogate code points, in order to be compatible with (well-formed) UTF-16. As a consequence, well-formed UTF-8 (and&str
) can not represent all JavaScript strings without some sort of escape sequence mechanism.Stylo currently replaces unpaired surrogates with U+FFFD when converting JS strings to UTF-8. This is equivalent to defining WebIDL interfaces with
USVString
instead ofDOMString
. This is a deviation from specified and currently-interoperable behavior.It would be possible to make Stylo preserve surrogates (for example by moving everything to WTF-8). However we’re inclined not to. Preserving surrogates is an historical accident, not a feature. I argue that any occurrence of surrogates in a JS string is likely an error, and coming up with an example where not preserving them in CSSOM makes an observable difference is extremely convoluted. For example:
http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5012
So I would like to propose changing CSSOM and other CSS specifications that declare WebIDL interfaces to use
USVString
instead ofDOMString
. This makes CSS syntax “Unicode-clean”, and enable implementations to use UTF-8 internally.CSSWG discussed and rejected in 2014 a proposal that was effectively the same. However neither
USVString
nor Stylo existed at the time. What has changed is that WebIDL now gives us the tool to easily specify this change, and one major implementation is on a path to likely to make this change.The text was updated successfully, but these errors were encountered: