Skip to content

Fix #3410, #3182: Allow regex to start with space or = #3782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 10, 2015

Conversation

lydell
Copy link
Collaborator

@lydell lydell commented Jan 10, 2015

A regex may not follow a specific set of tokens. These were already known before
in the NOT_REGEX and NOT_SPACED_REGEX arrays. (However, I've refactored them
to be more correct and to add a few missing tokens). In all other cases (except
after a spaced callable) a slash is the start of a regex, and may now start with
a space or an equals sign. It’s really that simple!

A slash after a spaced callable is the only ambigous case. We cannot know if
that's division or function application with a regex as the argument. The
spacing determines which is which:

Space on both sides:

  • a / b/i -> a / b / i
  • a /= b/i -> a /= b / i

No spaces:

  • a/b/i -> a / b / i
  • a/=b/i -> a /= b / i

Space on the right side:

  • a/ b/i -> a / b / i
  • a/= b/i -> a /= b / i

Space on the left side:

  • a /b/i -> a(/b/i)
  • a /=b/i -> a(/=b/i)

The last case used to compile to a /= b / i, but that has been changed to be
consistent with the / operator. The last case really looks like a regex, so it
should be parsed as one.

Moreover, you may now also space the / and /= operators with other
whitespace characters than a space (such as tabs and non-breaking spaces) for
consistency.

Lastly, unclosed regexes are now reported as such, instead of generating some
other confusing error message.

It should perhaps also be noted that apart from escaping (such as a /\ b/) you
may now also use parentheses to disambiguate division and regex: a (/ b/). See
#3182 (comment).

A regex may not follow a specific set of tokens. These were already known before
in the `NOT_REGEX` and `NOT_SPACED_REGEX` arrays. (However, I've refactored them
to be more correct and to add a few missing tokens). In all other cases (except
after a spaced callable) a slash is the start of a regex, and may now start with
a space or an equals sign. It’s really that simple!

A slash after a spaced callable is the only ambigous case. We cannot know if
that's division or function application with a regex as the argument. The
spacing determines which is which:

Space on both sides:
- `a / b/i`  -> `a / b / i`
- `a /= b/i` -> `a /= b / i`

No spaces:
- `a/b/i`    -> `a / b / i`
- `a/=b/i`   -> `a /= b / i`

Space on the right side:
- `a/ b/i`   -> `a / b / i`
- `a/= b/i`  -> `a /= b / i`

Space on the left side:
- `a /b/i`   -> `a(/b/i)`
- `a /=b/i`  -> `a(/=b/i)`

The last case used to compile to `a /= b / i`, but that has been changed to be
consistent with the `/` operator. The last case really looks like a regex, so it
should be parsed as one.

Moreover, you may now also space the `/` and `/=` operators with other
whitespace characters than a space (such as tabs and non-breaking spaces) for
consistency.

Lastly, unclosed regexes are now reported as such, instead of generating some
other confusing error message.

It should perhaps also be noted that apart from escaping (such as `a /\ b/`) you
may now also use parentheses to disambiguate division and regex: `a (/ b/)`. See
jashkenas#3182 (comment).
@lydell
Copy link
Collaborator Author

lydell commented Jan 10, 2015

Note that this PR contains nothing controversial; it is merely a refinement of the current semantics (IMO).

@epidemian
Copy link
Contributor

👏

It's awesome that you're tackling these hairy issues, @lydell!

On this issue in particular, i really wish the tokenization could be described with simpler rules, mainly for people to understand what something is just by looking at it, but also for tools other than the Coffee compiler to work well with the language. E.g., syntax highlighters will probably have a hard time getting regex highlighting right for CS. The tests for this PR, for instance, seem like a random mix of regexes and division symbols under GitHub's syntax highlighting.

But, given that that's probably not possible given Coffee's "relaxed" syntax, i think it's best for the compiler to work harder to match what the user intended to write, so i very much welcome this PR!

@lydell
Copy link
Collaborator Author

lydell commented Jan 10, 2015

The tests for this PR, for instance, seem like a random mix of regexes and division symbols under GitHub's syntax highlighting.

That’s because GitHub’s syntax highlighting uses the current rule that a regex may never start with a space or equals sign. It needs to be updated.

@lydell
Copy link
Collaborator Author

lydell commented Jan 10, 2015

understand what something is just by looking at it

If you look at the spacing examples above, don’t you think that you can tell what’s division and what’s regex just by looking? It’s all about spacing: Regexes are “tight”, while division is “loose” (unless you write unreadable code without whitespace; then division could be described as “too tight”).

@lydell
Copy link
Collaborator Author

lydell commented Jan 10, 2015

i really wish the tokenization could be described with simpler rules
...
But, given that that's probably not possible given Coffee's "relaxed" syntax

The problem is mainly that even in JavaScript it is difficult to tell division and regex apart. Many syntax highlighters fail on edge cases there, too (that’s my experience at least). See http://www-archive.mozilla.org/js/language/js20-2002-04/rationale/syntax.html#regular-expressions.

On top of that, allowing parenthesis-less function calls in CoffeeScript is what makes it harder. When designing the language I guess nobody thought that the parenthesis-less feature and copying JavaScript’s regex syntax would clash; that’s something found later when things were too late to change I’d guess.

So yes, I don’t think it is possible to do this in a simpler way.

@michaelficarra
Copy link
Collaborator

I can't find a test that maybeFn? /regex/g continues to work.

@epidemian
Copy link
Contributor

If you look at the spacing examples above, don’t you think that you can tell what’s division and what’s regex just by looking?

Yes, don't get me wrong, i think this PR improves the "WYMIWYGability" of the regex tokens. But, some of the "corner" cases, like a/=b/i, don't seem so immediately obvious to me, and understanding/explaining what's going on would require more complexity that i think it should.

unless you write unreadable code without whitespace; then division could be described as “too tight”

Totally agreed there! That's why coding guidelines and linting/formatting tools are so valuable.

The problem is mainly that even in JavaScript it is difficult to tell division and regex apart. Many syntax highlighters fail on edge cases there, too (that’s my experience at least).

Thanks for the links, i didn't know about that. My personal take on this is: if you're designing a language, avoid "overriding" the semantics of symbols as much as possible. E.g., parentheses only for grouping expressions -> good; parentheses for grouping expressions, calling functions, and declaring tuples -> not so good (that's why a one-element tuple in Python needs a trailing comma). Sadly, ASCII seems to fall short for most PLs, except the ones designed to have a minimalistic syntax.

@michaelficarra
Copy link
Collaborator

Thanks, @lydell. LGTM, but I'd like to get a few more eyes on it before merging.

@vendethiel
Copy link
Collaborator

LGTM as well, glad to see better error messages here (and more tests).

i really wish the tokenization could be described with simpler rules

Or maybe with some comments :P. I agree that the lexer is a bit hairy.

michaelficarra added a commit that referenced this pull request Jan 10, 2015
Fix #3410, #3182: Allow regex to start with space or =
@michaelficarra michaelficarra merged commit bec8f27 into jashkenas:master Jan 10, 2015
@lydell lydell deleted the regex branch January 10, 2015 17:34
@xixixao
Copy link
Contributor

xixixao commented Jan 10, 2015

Not a fan of supporting leading space in some cases (I write foo(/ z/), everything works, someone comes along months later, removes the "spurious" parens and bing, error), but the better error messages are really the crucial part of the solution.

@lydell
Copy link
Collaborator Author

lydell commented Jan 10, 2015

@xixixao I thought about exactly that case too. But @jashkenas made a point that you should be able to use parentheses to disambiguate. #3182 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants