Skip to content

Change OUTDENT tokens to be positioned at the end of the previous token #4296

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 10, 2016

Conversation

alangpierce
Copy link
Contributor

This is a fix in the CoffeeScript lexer that solved an issue in decaffeinate: decaffeinate/decaffeinate#371. The original PR was here: decaffeinate#4.

As far as I know, it has no effect on the actual CoffeeScript compiler (since I think
this location data is ignored), but you're welcome to incorporate it to make the
AST a little more sane if that's useful.

The change seems a little risky, but we've been using it for a while without any
trouble, and it makes the location data much nicer in some complicated cases.

Also note that this has a minor merge conflict with #4291: this adds a test in the
same place that that other PR adds a test. Probably both tests make sense.

Here's the original commit message:

This commit adds another post-processing step after normal lexing that sets the
locationData on all OUTDENT tokens to be at the last character of the previous
token. This does feel like a little bit of a hack. Ideally the location data
would be set correctly in the first place and not in a post-processing step, but
I tried that and some temporary intermediate tokens were causing problems, so I
decided to set the location data once those intermediate tokens were removed.
Also, having this as a separate processing step makes it more robust and
isolated.

This fixes the problem in decaffeinate/decaffeinate#371 .
In that issue, the CoffeeScript tokens had three OUTDENT tokens in a row, and
the last two overlapped with the ]. Since at least one of those OUTDENT tokens
was considered part of the function body, the function expression had an ending
position just after the end of the ].

OUTDENT tokens are sort of a weird case in the lexer anyway, since they often
don't correspond to an actual location in the source code. It seems like the
code in lexer.coffee makes an attempt at finding a good place for them, but in
some cases, it has a bad result. This seems hard to avoid in the general case.
For example, in this code:

[->
  a]

There must be an OUTDENT between the a and the ], but CoffeeScript tokens
have an inclusive start and end, so they must always be at least one character
wide (I think). In this case, the lexer was choosing the ] as the location,
and the parser ended up generating correct location data, I believe because
it ignores the outermost INDENT and OUTDENT tokens. However, with multiple
OUTDENT tokens in a row, the parser ends up producing location data that is
wrong.

It seems to me like there isn't a solid answer to "what location do OUTDENT
tokens have", since it hasn't mattered much, but for this commit, I'm defining
it: they always have the location of the last character of the previous token.
This should hopefully be fairly safe because tokens are still in the same order
relative to each other. Also, it's worth noting that this makes the start
location for OUTDENT tokens awkward. However, OUTDENT tokens are always used to
mark the end of something, so their last_line and last_column values are
always what matter when determining AST node bounds, so it is most important for
those to be correct.

@jashkenas
Copy link
Owner

Instead of adding a new Rewriter pass, can this be fixed by simply providing the OUTDENT tokens with the desired and correct position when they're first generated?

This commit adds another post-processing step after normal lexing that sets the
locationData on all OUTDENT tokens to be at the last character of the previous
token. This does feel like a little bit of a hack. Ideally the location data
would be set correctly in the first place and not in a post-processing step, but
I tried that and some temporary intermediate tokens were causing problems, so I
decided to set the location data once those intermediate tokens were removed.
Also, having this as a separate processing step makes it more robust and
isolated.

This fixes the problem in decaffeinate/decaffeinate#371 .
In that issue, the CoffeeScript tokens had three OUTDENT tokens in a row, and
the last two overlapped with the `]`. Since at least one of those OUTDENT tokens
was considered part of the function body, the function expression had an ending
position just after the end of the `]`.

OUTDENT tokens are sort of a weird case in the lexer anyway, since they often
don't correspond to an actual location in the source code. It seems like the
code in `lexer.coffee` makes an attempt at finding a good place for them, but in
some cases, it has a bad result. This seems hard to avoid in the general case.
For example, in this code:
```coffee
[->
  a]
```
There must be an OUTDENT between the `a` and the `]`, but CoffeeScript tokens
have an inclusive start and end, so they must always be at least one character
wide (I think). In this case, the lexer was choosing the `]` as the location,
and the parser ended up generating correct location data, I believe because
it ignores the outermost INDENT and OUTDENT tokens. However, with multiple
OUTDENT tokens in a row, the parser ends up producing location data that is
wrong.

It seems to me like there isn't a solid answer to "what location do OUTDENT
tokens have", since it hasn't mattered much, but for this commit, I'm defining
it: they always have the location of the last character of the previous token.
This should hopefully be fairly safe because tokens are still in the same order
relative to each other. Also, it's worth noting that this makes the start
location for OUTDENT tokens awkward. However, OUTDENT tokens are always used to
mark the end of something, so their `last_line` and `last_column` values are
always what matter when determining AST node bounds, so it is most important for
those to be correct.
Fixes decaffeinate/decaffeinate#446

In addition to OUTDENT tokens, CALL_END tokens can also be virtual tokens
without a real location, and sometimes they end up with a location that's
incorrect.
@alangpierce
Copy link
Contributor Author

Sorry for taking a long time to respond.

I tried setting the OUTDENT locations to be correct when they're first made, but I ran into some issues:

  • Currently it looks like token locations are always specified using the offsetInChunk parameter to makeToken. However, sometimes we would effectively need a negative offset. For example, in one example I saw, the chunk itself was just "}", but the important thing is for the end of the token to be at some character before the close-brace (since it's an inclusive end). This ran into some issues:
    • The getLineAndColumnFromChunk method can't handle negative values, and it looks like changing it to handle negative values would be really difficult because it needs to find the line and column information for code that has already been processed.
    • Even if it could handle negative values, finding the proper negative offset would be nontrivial since the list of tokens only holds the line and column information, so it's difficult to get the number of characters to move back.
  • I tried bypassing the makeToken method, but I still ran into problems. There were intermediate TERMINATOR tokens getting in the way that were later removed in the removeLeadingNewLines step in the rewriter, so the "OUTDENTs always have the location of the previous token" rule wasn't working.
  • I also thought about passing around additional information to the outdentToken method, but the fact that it is called from three different places makes that harder/uglier.

It seems like with enough effort, it could be made to work, but the rewriter approach seemed a lot more straightforward. I agree that the rewriter approach feels ugly in that it's fixing up information that was generated incorrectly, but it's also nice in that it's isolated and doesn't add as much complexity to the rest of the code.

Another thing I noticed recently is that CALL_END tokens have the same issue. I wrote up the details at decaffeinate/decaffeinate#446 . I have a PR to decaffeinate/coffeescript open with just that fix: decaffeinate#6.

I'll rebase and update the diff that also moves implicit CALL_END tokens (ones that don't correspond to a close-paren in the source code) to the location of the previous token. Those aren't generated until the rewriter stage, so that's another reason to fix up the location data in the rewriter stage.

@alangpierce alangpierce force-pushed the move-outdents-to-previous-token branch from 14f0939 to 88693e4 Compare October 7, 2016 04:37
@lydell
Copy link
Collaborator

lydell commented Oct 10, 2016

This is good stuff, and I think @alangpierce has motivated well why this needs to be done as a rewriter pass. So I'm in favor of merging this.

@alangpierce Was there anything else you planned to do before merging, or are we good to go?

@alangpierce
Copy link
Contributor Author

Nope, this should be good to go (my new commits had older timestamps, I guess, which is why they're showing up before my comment). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants