Skip to content

[cssom][all specs defining IDL] Consider USVString instead of DOMString, replacing surrogates with U+FFFD #1217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
SimonSapin opened this issue Apr 13, 2017 · 12 comments
Labels
cssom-1 Current Work

Comments

@SimonSapin
Copy link
Contributor

CSSOM uses WebIDL’s DOMString type for all string parameters and return values. It corresponds to JavaScript strings: arbitrary sequences of 16-bit code units. These are usually interpreted as UTF-16, but they’re not necessarily well-formed in UTF-16: they can contain unpaired surrogate code units. I sometimes call this encoding WTF-16.

(Character encoding decoders never emit surrogates when decoding bytes from the network, even when decoding UTF-16BE or UTF-16LE. So surrogates can’t end up in a string that way, only through JS.)

WebIDL also defines USVString which is a Unicode string. (A sequence of Unicode scalar values, excluding surrogate code points.) When converting to it from a JavaScript string, unpaired surrogate are replaced with the replacement character U+FFFD.

As far as I know all major browser engines currently use WTF-16 internally, so they preserve unpaired surrogates "by default" when strings go through various browser components where no code is actively looking for those.

In Firefox, we’re working on a new style system (Stylo, a.k.a. Quantum CSS) where strings are represented with Rust’s native &str type. &str uses UTF-8 bytes for its in-memory representation of Unicode and guarantees (as part of the type’s contract) that these bytes are well-formed in UTF-8. Unicode designed UTF-8 to specifically exclude surrogate code points, in order to be compatible with (well-formed) UTF-16. As a consequence, well-formed UTF-8 (and &str) can not represent all JavaScript strings without some sort of escape sequence mechanism.

Stylo currently replaces unpaired surrogates with U+FFFD when converting JS strings to UTF-8. This is equivalent to defining WebIDL interfaces with USVString instead of DOMString. This is a deviation from specified and currently-interoperable behavior.

It would be possible to make Stylo preserve surrogates (for example by moving everything to WTF-8). However we’re inclined not to. Preserving surrogates is an historical accident, not a feature. I argue that any occurrence of surrogates in a JS string is likely an error, and coming up with an example where not preserving them in CSSOM makes an observable difference is extremely convoluted. For example:

http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5012

<!DOCTYPE html>
<style></style>
<script>
document.documentElement.classList.add('\uD800');
document.styleSheets[0].insertRule('.\uD800:before { content: "Surrogates can be used in class names." }', 0);
document.styleSheets[0].insertRule('.\uD801:before { content: "Surrogates seem to be mapped to U+FFFD." }', 1);
</script>

So I would like to propose changing CSSOM and other CSS specifications that declare WebIDL interfaces to use USVString instead of DOMString. This makes CSS syntax “Unicode-clean”, and enable implementations to use UTF-8 internally.

CSSWG discussed and rejected in 2014 a proposal that was effectively the same. However neither USVString nor Stylo existed at the time. What has changed is that WebIDL now gives us the tool to easily specify this change, and one major implementation is on a path to likely to make this change.

@SimonSapin SimonSapin added the cssom-1 Current Work label Apr 13, 2017
@zcorpan
Copy link
Member

zcorpan commented Apr 13, 2017

Shouldn't the test be \uFFFD:before?

@hsivonen
Copy link
Member

Additional note about &str for CSS WG members who don't program in Rust yet:

Since &str is the Rust native way of looking at text, being able to use it is very valuable. It means that the style system can call into other Rust code built with the usual Rust assumptions about in-memory data that has textual interpretation.

Preserving unpaired surrogates would mean giving up on the Rust native way of representing in-memory textual data and fighting the Rust ecosystem when it comes to calling into other code, etc.

@SimonSapin
Copy link
Contributor Author

@zcorpan, hmm you’re right. New test case:

http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=5022

<!DOCTYPE html>
<style></style>
<script>
document.documentElement.classList.add('\uD800');
document.styleSheets[0].insertRule('@media all { \
:root.\uD800:not(.\D801):not(.\uFFFD):before { content: "Surrogates seem to be preserved." } \
:root:not(.\uD800):not(.\uFFFD):before { content: "Surrogates seem to be mapped to U+FFFD in CSSOM but not in DOM." } \
:root.\uD801:before { content: "Surrogates seem to be mapped to U+FFFD in both CSSOM and DOM." } \
}', 0);
</script>

(The "both" case is Servo.)

@svgeesus
Copy link
Contributor

Unpaired surrogates can only happen when injected by badly-written (or, maybe, malicious?) scripts. There is no actual use case for having them, and being required to preserve them is unfortunate. We should take this opportunity to dump them.

@tabatkins
Copy link
Member

Per https://heycam.github.io/webidl/#idl-USVString, DOMString is meant to be preferred unless there's a good reason to need only scalar values.

In the previous discussion, the major argument against making CSS Unicode-clean was the perf impact of the extra scanning step required. Is that still valid?

@SimonSapin
Copy link
Contributor Author

That advice is debatable: whatwg/webidl#84

The performance impact is non-zero for implementations that use WTF-16 internally. (Though I don’t know how significant it is.) In 2014 when all vendors at the table were in that case, “Unicode is nicer” by itself was not a good enough reason. However Stylo is now coming to Firefox, hopefully shipping in November 2017. It currently uses UTF-8 without surrogates internally, and given the trade-off (no use case & low estimated web-compat risk v.s. internal API compat issues of using a different string type) this is unlikely to change before shipping. I want at least to give heads up to the WG that Firefox will likely deviate from currently-interoperable behavior.

@css-meeting-bot
Copy link
Member

The CSS Working Group just discussed Consider using USVString instead of DOMString, and agreed to the following resolutions:

RESOLVED: CSSOM can use either USVString or DOMString
The full IRC log of that discussion
<TabAtkins> Topic: Consider using USVString instead of DOMString
<fantasai> ScribeNick: fantasai
<fantasai> SimonSapin: In JS, strings are made of a sequence of 16-bit integers
<fantasai> SimonSapin: Can be arbitrary sequence
<fantasai> SimonSapin: Usually interpreted as UTF-16
<fantasai> SimonSapin: But don't have to be well-formed UTF-16
<fantasai> SimonSapin: In particular, range of values that are called surrogates
<fantasai> SimonSapin: If you have a leading surrogate plus trailing surrogate, i.e. 2 UTF-16 ints, that forms a single Unicode codepoint
<fantasai> SimonSapin: But in JS, nothing stops surrogates from appearing in the wrong order, or a single one by itself
<fantasai> SimonSapin: This is invalid Unicode
<fantasai> SimonSapin: But you can do it in JS
<fantasai> SimonSapin: If you want to convert that string to UTF-8, UTF-8 is designed to exclude surrogate codepoints to align with set of valid UTF-16 strings
<fantasai> SimonSapin: So not all JS strings can be represented in UTF-8 without losing data or using escaping mechanism
<fantasai> SimonSapin: Wasn't an issue, because every browser internally uses same type of string as JS
<dbaron> Github topic: https://github.com/w3c/csswg-drafts/issues/1217
<fantasai> SimonSapin: So if ou have CSSOM string that has unpaired surrogate, e.g. in an ident or content property string
<fantasai> SimonSapin: it's ok
<fantasai> SimonSapin: What's changing now is that in Firefox, we have a project called Stylo, which is to import Servo style system into Gecko
<fantasai> SimonSapin: That style system is using Rust str type for all strings
<fantasai> SimonSapin: which is based on UTF-8, so it cannot represent unpaired surrogates
<fantasai> SimonSapin: what that means is in practice, whenever a string comes from CSSOM and goes into the style system in Servo and in the future in Firefox, we convert to UTF-8 and in that process, any unpaired surrogate is replaced with U+FFFD REPLACEMENT CHARACTER
<fantasai> SimonSapin: So there is some data loss
<fantasai> SimonSapin: However, I think this kind of situation only happens accidentally
<fantasai> SimonSapin: Fact that JS strings are this way is not a feature, it's a historical accident
<fantasai> SimonSapin: I don't think there is a real compat risk with shipping Firefox this way
<fantasai> SimonSapin: Still, it's a deviation from current interoperable behavior, so wanted to bring it up
<fantasai> Florian: Proposal?
<fantasai> SimonSapin: In WebIDL which we use to define itnerfaces for JS, there is two string types. DOMString corresponds to JS strings with aribtrary 16-bit
<fantasai> SimonSapin: There is USVString, Unicode Scalara Value String, which has no unpaired surrogates, only well-formed unicode
<fantasai> SimonSapin: When you convert DOMString to that, you get the same behavior as in Servo, replacing lone surrogates with UFFFD
<fantasai> SimonSapin: If we want to keep this interoperable, then I propse to use USVString for all of CSSOM
<fantasai> ChrisL_: Seems like a good idea, since unpaired surrogates are only an error
<fantasai> ChrisL_: Only used for binary data, and cna't imagine that in CSSOM
<fantasai> TabAtkins: USVString is supposed to be avoided in WebIDL
<fantasai> TabAtkins: Currently only used in networking protocols that use scalar values
<fantasai> TabAtkins: Requires extra processing compared to UTF-16 strings
<fremy> ChrisL_: maybe in custom properties though, people would want to store binary data; they should encode it to avoid syntax issues though so no big deal
<fantasai> dbaron: Anne disagrees with advice in WebIDl spec, btw
<tantek> s/Anne/Annevk
<fantasai> dbaron: There's a github issue against WebIDL spec to give coherent advice, but ppl disagree on what that should be
<fantasai> dino: Appreciate that you want to use rust string type, but we all have to use our own string types
<fantasai> dino: Maybe resolution is all DOM strings should be that way
<fantasai> TabAtkins: No, that would break a lot of things
<fantasai> TabAtkins: ppl smuggle binary date in JS strings
<fantasai> TabAtkins: But for things that talk text, coudl do it
<fantasai> dino: Everything, not just CSSOM
<fantasai> Florian: Would it be reasonable for implementations that don't do rust strings internally
<fantasai> myles_: If we don't know perf impact, can't agree to do this
<dbaron> myles_: so somebody somewhere has to try it first before we agree to it
<fantasai> SimonSapin: Tab, ?
<iank_> q+
<fantasai> TabAtkins: Some DOM Apis have to be 16-bit, e.g. Fetch ...
<fantasai> rbyers: It's not in Chrome
<fantasai> TabAtkins muses
<SimonSapin> s/?/did you mean changing JS or DOM would break things?/
<fantasai> till: It's not entirely out of the question that it would be Web-compatible enough that we could change it in JS itself
<fantasai> fantasai: For JS itself, Tab was saying it's not doable, but for DOM Apis more likely to be possible
<fantasai> iank_: Need to check with architecture folks about this
<fantasai> iank_: our architectue folks in charge of bindings and stirng types and stuff
<fantasai> iank_: Looked for code where we switch to USVStrings, and that's very expensive for us it looks like
<fantasai> iank_: Might be perf problems
<fantasai> fantasai: My take is that this is a veyr weird edge case with no real use... lone surrogates in the CSSOM.
<fantasai> fantasai: So I would say, let's spec you can use either, and we don't care.
<fantasai> myles_: Every string would have to get transcoded, that's crazy
<fantasai> TabAtkins: ....
<fantasai> iank_: Would have to guarantee that htat block internally is clean
<fantasai> TabAtkins: Move to UTF-8 clean internally
<fantasai> iank_: Sounds non-triial
<fantasai> Florian: Spec that either is Okay then it's not any work
<fantasai> Florian: If we can't spec that, then it means web depends on it, so Servo will have to bite the bullet
<fantasai> TabAtkins: I'm okay with doing that, put a note that we'd like to move th USVString
<fantasai> shane: If there's a webb compat problem, then it's a problem
<fantasai> TabAtkins: That means someone is injecting lone surrogates into the CSSOM. Can't come out of the parser
<fantasai> TabAtkins: In that case probably buggy anyway
<fantasai> shane: If ppl notice a problem, they'll file bugs
<fantasai> eae: It's very hard to get into the situation except intentionally
<fantasai> myles_: Does Servo have to translate between JS string and USVString all the time?
<fantasai> SimonSapin: Yes
<fantasai> SimonSapin: We have optimizations, e.g. if ascii then stord in one byte per uit, skip UTF-8 conversion
<fantasai> s/USVString/UTF-8 String/
<fantasai> Florian: Can we just resolve on both and if it's a problem, come back and we'll change hte spec?
<fantasai> fantasai: I think interop in this very very weird case is not worth any effort, so it should allow both
<fantasai> till: It's not servo-specific, others might want UTF-8 codepaths
<fantasai> myles_: I believe that you believe that.
<fantasai> Rossen: Anyy objections?
<fantasai> rbyers: We should rediscuss if we find web compat issues
<fantasai> RESOLVED: CSSOM can use either USVString or DOMString
<fantasai> fantasai: We can alwasy raise issues if they're found later.
<fantasai> SimonSapin: This also affects other specs with WebIDL interfaces, e.g. CSS Fonts defines @font-face interfaces
<fantasai> Florian: Should we define a CSSString?
<fantasai> ...
<fantasai> iank_: But if we do this later..
<fantasai> fantasai: We are literally deciding that you can do either, forever. Unless someone comes back and says "lone surrogates in CSSOm are an important use case and I need them"
<fantasai> Rossen discusses agenda items

SimonSapin added a commit to SimonSapin/webidl that referenced this issue Apr 19, 2017
The observable behavior difference is whether surrogate code units
are preserved or replaced with U+FFFD REPLACEMENT CHARACTER.

CSSWG resolved to allow either in CSSOM:
w3c/csswg-drafts#1217 (comment)
@SimonSapin
Copy link
Contributor Author

SimonSapin commented Apr 19, 2017

As far as I can tell a new WebIDL type needs to be defined in WebIDL, so I filed whatwg/webidl#347

@SimonSapin
Copy link
Contributor Author

TIL WebIDL has typedef.

@zcorpan
Copy link
Member

zcorpan commented Apr 19, 2017

We can't use a union type of DOMString or USVString because they are both string types and union types must be distinguishable.

But we can typedef to only USVString and say in prose that it is conforming to use DOMString instead (or vice versa).

SimonSapin added a commit to SimonSapin/csswg-drafts that referenced this issue Apr 21, 2017
Implementations can choose one or the other.

CSSWG resolution:
w3c#1217 (comment)
SimonSapin added a commit to SimonSapin/csswg-drafts that referenced this issue Apr 21, 2017
CSSWG resolution:
w3c#1217 (comment)

Fix w3c#1217.

Each occurrence is one of:

* CSS syntax
* A name (for example a property name) that also occurs in CSS syntax
* `Stylesheet::type`, which is always `text/css`.
* `Stylesheet::title`, which is set from the eponymous HTML content attribute
  of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title)
  and [`<link>`>](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title)
  elements.
SimonSapin added a commit to SimonSapin/csswg-drafts that referenced this issue Apr 21, 2017
CSSWG resolution:
w3c#1217 (comment)

Fix w3c#1217.

Each occurrence is one of:

* CSS syntax
* A name (for example a property name) that also occurs in CSS syntax
* `Stylesheet::type`, which is always `text/css`.
* `Stylesheet::title`, which is set from the eponymous HTML content attribute
  of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title)
  and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title)
  elements.
SimonSapin added a commit to SimonSapin/csswg-drafts that referenced this issue Apr 21, 2017
CSSWG resolution:
w3c#1217 (comment)

Fix w3c#1217.

Each occurrence is one of:

* CSS syntax
* A name (for example a property name) that also occurs in CSS syntax
* `Stylesheet::type`, which is always `text/css`.
* `Stylesheet::title`, which is set from the eponymous HTML content attribute
  of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title)
  and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title)
  elements.
SimonSapin added a commit to SimonSapin/csswg-drafts that referenced this issue Apr 21, 2017
CSSWG resolution:
w3c#1217 (comment)

Fix w3c#1217.

Each occurrence is one of:

* CSS syntax
* A name (for example a property name) that also occurs in CSS syntax
* `Stylesheet::type`, which is always `text/css`.
* `Stylesheet::title`, which is set from the eponymous HTML content attribute
  of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title)
  and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title)
  elements.
SimonSapin added a commit to SimonSapin/csswg-drafts that referenced this issue Apr 21, 2017
CSSWG resolution:
w3c#1217 (comment)

Fix w3c#1217.

Each occurrence is one of:

* CSS syntax
* A name (for example a property name) that also occurs in CSS syntax
* `Stylesheet::type`, which is always `text/css`.

Not replaced:

* `Stylesheet::title`, which is set from the eponymous HTML content attribute
  of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title)
  and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title)
  elements.
  These contant attributes are reflected as `HTMLElement::title` DOM attributes,
  where they are `DOMString`.
zcorpan pushed a commit that referenced this issue Apr 21, 2017
Implementations can choose one or the other.

CSSWG resolution:
#1217 (comment)
zcorpan pushed a commit that referenced this issue Apr 21, 2017
CSSWG resolution:
#1217 (comment)

Fix #1217.

Each occurrence is one of:

* CSS syntax
* A name (for example a property name) that also occurs in CSS syntax
* `Stylesheet::type`, which is always `text/css`.

Not replaced:

* `Stylesheet::title`, which is set from the eponymous HTML content attribute
  of [`<style>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-style-title)
  and [`<link>`](https://html.spec.whatwg.org/multipage/semantics.html#attr-link-title)
  elements.
  These contant attributes are reflected as `HTMLElement::title` DOM attributes,
  where they are `DOMString`.
@foolip
Copy link
Member

foolip commented May 7, 2018

As an FYI, typedef USVString CSSOMString made its way into web-platform-tests in web-platform-tests/wpt#6982, and just like an implementation the IDL can only say one thing here, not "this or that." Fortunately idlharness.js doesn't automatically exercise the difference between DOMString and USVString and actually can't because one needs to understand the API under test to know where the converted string ends up to read it back.

And this does mean that tests like https://github.com/w3c/web-platform-tests/blob/master/html/dom/usvstring-reflection.html will be impossible to add for CSSOM.

@zcorpan
Copy link
Member

zcorpan commented May 7, 2018

At least tests could check that an implementation is consistently either DOMString or USVString for all CSSOMString things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cssom-1 Current Work
Projects
None yet
Development

No branches or pull requests

7 participants