Skip to content

Restrictions on what can go in inline comments #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
bakkot opened this issue Oct 27, 2021 · 5 comments
Open

Restrictions on what can go in inline comments #1

bakkot opened this issue Oct 27, 2021 · 5 comments

Comments

@bakkot
Copy link

bakkot commented Oct 27, 2021

There is a base RegularExpressionLiteral grammar which we almost certainly can't change, and I think that will impose some limitations on what characters can be legal unescaped in comments. In particular, I think [ and] might need to be escaped, since the lexer thinks those represent character classes which must balance.

@rbuckton
Copy link
Collaborator

I imagine that comments are more likely to be used via the RegExp constructor than in a RegularExpressionLiteral, though I would like to support (?#) in literals. This might require [] be escaped in addition to ()/ in a literal, though escaping for anything other than () is unnecessary in a raw string passed to the RegExp constructor.

It also doesn't seem like it would be difficult to modify RegularExpressionBody to have an additional balancing construct for (?# and ), though it seemed there was reticence to such a change that I don't fully understand.

@bakkot
Copy link
Author

bakkot commented Oct 27, 2021

I suspect @waldemarhorwat and @erights are the best people to speak to any potential issues with modifying RegularExpressionBody.

@erights
Copy link

erights commented Dec 15, 2021

Every time there is ambiguity in how to lex/parse JS, we have seen that enable attacks. For example, the ambiguity around html-like comments enabled an attack on Caja. Some approaches to filtering code for some safety relies on an accurate tokenization without requiring a parse. For example, Caja did a conservative scan for all identifiers that might be a use occurrence of a variable. Given accurate tokenization, this is guaranteed to include all use occurrences of free variables. (Caja was actually even more conservative, picking up all identifiers, even in comments and literal strings. But this was expensive. If we had a tokenizers we were confident in, we could have used that.)

There is other old software that relies on being able to tokenize accurately, even if it can no longer parse. For example, some simple syntax highlighters. For example, there is an old macro library for JS that relies on being able to tokenize, followed by being able to bracket match. Bracket matching is a parsing level constraint that has also not been broken by the continuing evolution of JS. In fact, tokenizing template literals requires bracket matching, so stability of tokenization does seem to require the bracket-matching level of parsing as well.

@rbuckton
Copy link
Collaborator

In plenary today I mistakenly said, "we should ban x-mode for RegExp literals". However, that would not satisfy the constraint here as x-mode does not affect the behavior of (?#). We could merely require that [ and ] be escaped when in a comment in RegularExpressionBody as is proposed in the OP, but they don't necessarily need to be escaped when passed as a string to the RegExp constructor, just as / must be escaped in a RegularExpressionLiteral but does not need to be escaped when passed as a string to the constructor.

@waldemarhorwat
Copy link

Having the RegularExpressionBody grammar not be a cover grammar (modulo /) for the real regular expression grammar would be terrible.

The invariant is that any valid regular expression according to the RegExp grammar that does not contain an internal / must also be a valid RegularExpressionBody expansion. We carefully engineered the v mode to retain this invariant.

Introducing (?#) comments without the requirement of escaping [ and ] will break the invariant. In /ab(?#[)/i]/m, the regular expression should end at the first / but RegularExpressionBody will also swallow up some of the following characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants