Change OUTDENT tokens to be positioned at the end of the previous token #4296

alangpierce · 2016-08-23T02:22:32Z

This is a fix in the CoffeeScript lexer that solved an issue in decaffeinate: decaffeinate/decaffeinate#371. The original PR was here: decaffeinate#4.

As far as I know, it has no effect on the actual CoffeeScript compiler (since I think
this location data is ignored), but you're welcome to incorporate it to make the
AST a little more sane if that's useful.

The change seems a little risky, but we've been using it for a while without any
trouble, and it makes the location data much nicer in some complicated cases.

Also note that this has a minor merge conflict with #4291: this adds a test in the
same place that that other PR adds a test. Probably both tests make sense.

Here's the original commit message:

This commit adds another post-processing step after normal lexing that sets the
locationData on all OUTDENT tokens to be at the last character of the previous
token. This does feel like a little bit of a hack. Ideally the location data
would be set correctly in the first place and not in a post-processing step, but
I tried that and some temporary intermediate tokens were causing problems, so I
decided to set the location data once those intermediate tokens were removed.
Also, having this as a separate processing step makes it more robust and
isolated.

This fixes the problem in decaffeinate/decaffeinate#371 .
In that issue, the CoffeeScript tokens had three OUTDENT tokens in a row, and
the last two overlapped with the ]. Since at least one of those OUTDENT tokens
was considered part of the function body, the function expression had an ending
position just after the end of the ].

OUTDENT tokens are sort of a weird case in the lexer anyway, since they often
don't correspond to an actual location in the source code. It seems like the
code in lexer.coffee makes an attempt at finding a good place for them, but in
some cases, it has a bad result. This seems hard to avoid in the general case.
For example, in this code:

[->
  a]

There must be an OUTDENT between the a and the ], but CoffeeScript tokens
have an inclusive start and end, so they must always be at least one character
wide (I think). In this case, the lexer was choosing the ] as the location,
and the parser ended up generating correct location data, I believe because
it ignores the outermost INDENT and OUTDENT tokens. However, with multiple
OUTDENT tokens in a row, the parser ends up producing location data that is
wrong.

It seems to me like there isn't a solid answer to "what location do OUTDENT
tokens have", since it hasn't mattered much, but for this commit, I'm defining
it: they always have the location of the last character of the previous token.
This should hopefully be fairly safe because tokens are still in the same order
relative to each other. Also, it's worth noting that this makes the start
location for OUTDENT tokens awkward. However, OUTDENT tokens are always used to
mark the end of something, so their last_line and last_column values are
always what matter when determining AST node bounds, so it is most important for
those to be correct.

jashkenas · 2016-09-07T19:02:10Z

Instead of adding a new Rewriter pass, can this be fixed by simply providing the OUTDENT tokens with the desired and correct position when they're first generated?

This commit adds another post-processing step after normal lexing that sets the locationData on all OUTDENT tokens to be at the last character of the previous token. This does feel like a little bit of a hack. Ideally the location data would be set correctly in the first place and not in a post-processing step, but I tried that and some temporary intermediate tokens were causing problems, so I decided to set the location data once those intermediate tokens were removed. Also, having this as a separate processing step makes it more robust and isolated. This fixes the problem in decaffeinate/decaffeinate#371 . In that issue, the CoffeeScript tokens had three OUTDENT tokens in a row, and the last two overlapped with the `]`. Since at least one of those OUTDENT tokens was considered part of the function body, the function expression had an ending position just after the end of the `]`. OUTDENT tokens are sort of a weird case in the lexer anyway, since they often don't correspond to an actual location in the source code. It seems like the code in `lexer.coffee` makes an attempt at finding a good place for them, but in some cases, it has a bad result. This seems hard to avoid in the general case. For example, in this code: ```coffee [-> a] ``` There must be an OUTDENT between the `a` and the `]`, but CoffeeScript tokens have an inclusive start and end, so they must always be at least one character wide (I think). In this case, the lexer was choosing the `]` as the location, and the parser ended up generating correct location data, I believe because it ignores the outermost INDENT and OUTDENT tokens. However, with multiple OUTDENT tokens in a row, the parser ends up producing location data that is wrong. It seems to me like there isn't a solid answer to "what location do OUTDENT tokens have", since it hasn't mattered much, but for this commit, I'm defining it: they always have the location of the last character of the previous token. This should hopefully be fairly safe because tokens are still in the same order relative to each other. Also, it's worth noting that this makes the start location for OUTDENT tokens awkward. However, OUTDENT tokens are always used to mark the end of something, so their `last_line` and `last_column` values are always what matter when determining AST node bounds, so it is most important for those to be correct.

Fixes decaffeinate/decaffeinate#446 In addition to OUTDENT tokens, CALL_END tokens can also be virtual tokens without a real location, and sometimes they end up with a location that's incorrect.

alangpierce · 2016-10-07T04:35:59Z

Sorry for taking a long time to respond.

I tried setting the OUTDENT locations to be correct when they're first made, but I ran into some issues:

Currently it looks like token locations are always specified using the offsetInChunk parameter to makeToken. However, sometimes we would effectively need a negative offset. For example, in one example I saw, the chunk itself was just "}", but the important thing is for the end of the token to be at some character before the close-brace (since it's an inclusive end). This ran into some issues:
- The getLineAndColumnFromChunk method can't handle negative values, and it looks like changing it to handle negative values would be really difficult because it needs to find the line and column information for code that has already been processed.
- Even if it could handle negative values, finding the proper negative offset would be nontrivial since the list of tokens only holds the line and column information, so it's difficult to get the number of characters to move back.
I tried bypassing the makeToken method, but I still ran into problems. There were intermediate TERMINATOR tokens getting in the way that were later removed in the removeLeadingNewLines step in the rewriter, so the "OUTDENTs always have the location of the previous token" rule wasn't working.
I also thought about passing around additional information to the outdentToken method, but the fact that it is called from three different places makes that harder/uglier.

It seems like with enough effort, it could be made to work, but the rewriter approach seemed a lot more straightforward. I agree that the rewriter approach feels ugly in that it's fixing up information that was generated incorrectly, but it's also nice in that it's isolated and doesn't add as much complexity to the rest of the code.

Another thing I noticed recently is that CALL_END tokens have the same issue. I wrote up the details at decaffeinate/decaffeinate#446 . I have a PR to decaffeinate/coffeescript open with just that fix: decaffeinate#6.

I'll rebase and update the diff that also moves implicit CALL_END tokens (ones that don't correspond to a close-paren in the source code) to the location of the previous token. Those aren't generated until the rewriter stage, so that's another reason to fix up the location data in the rewriter stage.

lydell · 2016-10-10T15:26:01Z

This is good stuff, and I think @alangpierce has motivated well why this needs to be done as a rewriter pass. So I'm in favor of merging this.

@alangpierce Was there anything else you planned to do before merging, or are we good to go?

alangpierce · 2016-10-10T15:41:21Z

Nope, this should be good to go (my new commits had older timestamps, I guess, which is why they're showing up before my comment). Thanks!

alangpierce mentioned this pull request Aug 23, 2016

CS2 Discussion: Question: New Compiler Starting Point coffeescript6/discuss#25

Closed

alangpierce added 2 commits October 6, 2016 19:39

Fix location data for implicit CALL_END tokens

88693e4

Fixes decaffeinate/decaffeinate#446 In addition to OUTDENT tokens, CALL_END tokens can also be virtual tokens without a real location, and sometimes they end up with a location that's incorrect.

alangpierce force-pushed the move-outdents-to-previous-token branch from 14f0939 to 88693e4 Compare October 7, 2016 04:37

lydell merged commit 0853b41 into jashkenas:master Oct 10, 2016

alangpierce mentioned this pull request Oct 31, 2016

Include generated } tokens when fixing closing token positions #4350

Merged

alangpierce mentioned this pull request Jun 25, 2017

Location data for functions should include the ending semicolon #4584

Closed

coffeescriptbot mentioned this pull request Feb 19, 2018

CS2 Discussion: Question: New Compiler Starting Point #4923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change OUTDENT tokens to be positioned at the end of the previous token #4296

Change OUTDENT tokens to be positioned at the end of the previous token #4296

alangpierce commented Aug 23, 2016

jashkenas commented Sep 7, 2016

alangpierce commented Oct 7, 2016

lydell commented Oct 10, 2016

alangpierce commented Oct 10, 2016

Change OUTDENT tokens to be positioned at the end of the previous token #4296

Change OUTDENT tokens to be positioned at the end of the previous token #4296

Conversation

alangpierce commented Aug 23, 2016

jashkenas commented Sep 7, 2016

alangpierce commented Oct 7, 2016

lydell commented Oct 10, 2016

alangpierce commented Oct 10, 2016