-
Notifications
You must be signed in to change notification settings - Fork 2k
Fix #3410, #3182: Allow regex to start with space or = #3782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A regex may not follow a specific set of tokens. These were already known before in the `NOT_REGEX` and `NOT_SPACED_REGEX` arrays. (However, I've refactored them to be more correct and to add a few missing tokens). In all other cases (except after a spaced callable) a slash is the start of a regex, and may now start with a space or an equals sign. It’s really that simple! A slash after a spaced callable is the only ambigous case. We cannot know if that's division or function application with a regex as the argument. The spacing determines which is which: Space on both sides: - `a / b/i` -> `a / b / i` - `a /= b/i` -> `a /= b / i` No spaces: - `a/b/i` -> `a / b / i` - `a/=b/i` -> `a /= b / i` Space on the right side: - `a/ b/i` -> `a / b / i` - `a/= b/i` -> `a /= b / i` Space on the left side: - `a /b/i` -> `a(/b/i)` - `a /=b/i` -> `a(/=b/i)` The last case used to compile to `a /= b / i`, but that has been changed to be consistent with the `/` operator. The last case really looks like a regex, so it should be parsed as one. Moreover, you may now also space the `/` and `/=` operators with other whitespace characters than a space (such as tabs and non-breaking spaces) for consistency. Lastly, unclosed regexes are now reported as such, instead of generating some other confusing error message. It should perhaps also be noted that apart from escaping (such as `a /\ b/`) you may now also use parentheses to disambiguate division and regex: `a (/ b/)`. See jashkenas#3182 (comment).
Note that this PR contains nothing controversial; it is merely a refinement of the current semantics (IMO). |
👏 It's awesome that you're tackling these hairy issues, @lydell! On this issue in particular, i really wish the tokenization could be described with simpler rules, mainly for people to understand what something is just by looking at it, but also for tools other than the Coffee compiler to work well with the language. E.g., syntax highlighters will probably have a hard time getting regex highlighting right for CS. The tests for this PR, for instance, seem like a random mix of regexes and division symbols under GitHub's syntax highlighting. But, given that that's probably not possible given Coffee's "relaxed" syntax, i think it's best for the compiler to work harder to match what the user intended to write, so i very much welcome this PR! |
That’s because GitHub’s syntax highlighting uses the current rule that a regex may never start with a space or equals sign. It needs to be updated. |
If you look at the spacing examples above, don’t you think that you can tell what’s division and what’s regex just by looking? It’s all about spacing: Regexes are “tight”, while division is “loose” (unless you write unreadable code without whitespace; then division could be described as “too tight”). |
The problem is mainly that even in JavaScript it is difficult to tell division and regex apart. Many syntax highlighters fail on edge cases there, too (that’s my experience at least). See http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions. On top of that, allowing parenthesis-less function calls in CoffeeScript is what makes it harder. When designing the language I guess nobody thought that the parenthesis-less feature and copying JavaScript’s regex syntax would clash; that’s something found later when things were too late to change I’d guess. So yes, I don’t think it is possible to do this in a simpler way. |
I can't find a test that |
Yes, don't get me wrong, i think this PR improves the "WYMIWYGability" of the regex tokens. But, some of the "corner" cases, like
Totally agreed there! That's why coding guidelines and linting/formatting tools are so valuable.
Thanks for the links, i didn't know about that. My personal take on this is: if you're designing a language, avoid "overriding" the semantics of symbols as much as possible. E.g., parentheses only for grouping expressions -> good; parentheses for grouping expressions, calling functions, and declaring tuples -> not so good (that's why a one-element tuple in Python needs a trailing comma). Sadly, ASCII seems to fall short for most PLs, except the ones designed to have a minimalistic syntax. |
Thanks, @lydell. LGTM, but I'd like to get a few more eyes on it before merging. |
LGTM as well, glad to see better error messages here (and more tests).
Or maybe with some comments :P. I agree that the lexer is a bit hairy. |
Not a fan of supporting leading space in some cases (I write |
@xixixao I thought about exactly that case too. But @jashkenas made a point that you should be able to use parentheses to disambiguate. #3182 (comment) |
A regex may not follow a specific set of tokens. These were already known before
in the
NOT_REGEX
andNOT_SPACED_REGEX
arrays. (However, I've refactored themto be more correct and to add a few missing tokens). In all other cases (except
after a spaced callable) a slash is the start of a regex, and may now start with
a space or an equals sign. It’s really that simple!
A slash after a spaced callable is the only ambigous case. We cannot know if
that's division or function application with a regex as the argument. The
spacing determines which is which:
Space on both sides:
a / b/i
->a / b / i
a /= b/i
->a /= b / i
No spaces:
a/b/i
->a / b / i
a/=b/i
->a /= b / i
Space on the right side:
a/ b/i
->a / b / i
a/= b/i
->a /= b / i
Space on the left side:
a /b/i
->a(/b/i)
a /=b/i
->a(/=b/i)
The last case used to compile to
a /= b / i
, but that has been changed to beconsistent with the
/
operator. The last case really looks like a regex, so itshould be parsed as one.
Moreover, you may now also space the
/
and/=
operators with otherwhitespace characters than a space (such as tabs and non-breaking spaces) for
consistency.
Lastly, unclosed regexes are now reported as such, instead of generating some
other confusing error message.
It should perhaps also be noted that apart from escaping (such as
a /\ b/
) youmay now also use parentheses to disambiguate division and regex:
a (/ b/)
. See#3182 (comment).