-
Notifications
You must be signed in to change notification settings - Fork 2k
Change OUTDENT tokens to be positioned at the end of the previous token #4296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change OUTDENT tokens to be positioned at the end of the previous token #4296
Conversation
Instead of adding a new Rewriter pass, can this be fixed by simply providing the OUTDENT tokens with the desired and correct position when they're first generated? |
This commit adds another post-processing step after normal lexing that sets the locationData on all OUTDENT tokens to be at the last character of the previous token. This does feel like a little bit of a hack. Ideally the location data would be set correctly in the first place and not in a post-processing step, but I tried that and some temporary intermediate tokens were causing problems, so I decided to set the location data once those intermediate tokens were removed. Also, having this as a separate processing step makes it more robust and isolated. This fixes the problem in decaffeinate/decaffeinate#371 . In that issue, the CoffeeScript tokens had three OUTDENT tokens in a row, and the last two overlapped with the `]`. Since at least one of those OUTDENT tokens was considered part of the function body, the function expression had an ending position just after the end of the `]`. OUTDENT tokens are sort of a weird case in the lexer anyway, since they often don't correspond to an actual location in the source code. It seems like the code in `lexer.coffee` makes an attempt at finding a good place for them, but in some cases, it has a bad result. This seems hard to avoid in the general case. For example, in this code: ```coffee [-> a] ``` There must be an OUTDENT between the `a` and the `]`, but CoffeeScript tokens have an inclusive start and end, so they must always be at least one character wide (I think). In this case, the lexer was choosing the `]` as the location, and the parser ended up generating correct location data, I believe because it ignores the outermost INDENT and OUTDENT tokens. However, with multiple OUTDENT tokens in a row, the parser ends up producing location data that is wrong. It seems to me like there isn't a solid answer to "what location do OUTDENT tokens have", since it hasn't mattered much, but for this commit, I'm defining it: they always have the location of the last character of the previous token. This should hopefully be fairly safe because tokens are still in the same order relative to each other. Also, it's worth noting that this makes the start location for OUTDENT tokens awkward. However, OUTDENT tokens are always used to mark the end of something, so their `last_line` and `last_column` values are always what matter when determining AST node bounds, so it is most important for those to be correct.
Fixes decaffeinate/decaffeinate#446 In addition to OUTDENT tokens, CALL_END tokens can also be virtual tokens without a real location, and sometimes they end up with a location that's incorrect.
Sorry for taking a long time to respond. I tried setting the OUTDENT locations to be correct when they're first made, but I ran into some issues:
It seems like with enough effort, it could be made to work, but the rewriter approach seemed a lot more straightforward. I agree that the rewriter approach feels ugly in that it's fixing up information that was generated incorrectly, but it's also nice in that it's isolated and doesn't add as much complexity to the rest of the code. Another thing I noticed recently is that CALL_END tokens have the same issue. I wrote up the details at decaffeinate/decaffeinate#446 . I have a PR to decaffeinate/coffeescript open with just that fix: decaffeinate#6. I'll rebase and update the diff that also moves implicit CALL_END tokens (ones that don't correspond to a close-paren in the source code) to the location of the previous token. Those aren't generated until the rewriter stage, so that's another reason to fix up the location data in the rewriter stage. |
14f0939
to
88693e4
Compare
This is good stuff, and I think @alangpierce has motivated well why this needs to be done as a rewriter pass. So I'm in favor of merging this. @alangpierce Was there anything else you planned to do before merging, or are we good to go? |
Nope, this should be good to go (my new commits had older timestamps, I guess, which is why they're showing up before my comment). Thanks! |
This is a fix in the CoffeeScript lexer that solved an issue in decaffeinate: decaffeinate/decaffeinate#371. The original PR was here: decaffeinate#4.
As far as I know, it has no effect on the actual CoffeeScript compiler (since I think
this location data is ignored), but you're welcome to incorporate it to make the
AST a little more sane if that's useful.
The change seems a little risky, but we've been using it for a while without any
trouble, and it makes the location data much nicer in some complicated cases.
Also note that this has a minor merge conflict with #4291: this adds a test in the
same place that that other PR adds a test. Probably both tests make sense.
Here's the original commit message:
This commit adds another post-processing step after normal lexing that sets the
locationData on all OUTDENT tokens to be at the last character of the previous
token. This does feel like a little bit of a hack. Ideally the location data
would be set correctly in the first place and not in a post-processing step, but
I tried that and some temporary intermediate tokens were causing problems, so I
decided to set the location data once those intermediate tokens were removed.
Also, having this as a separate processing step makes it more robust and
isolated.
This fixes the problem in decaffeinate/decaffeinate#371 .
In that issue, the CoffeeScript tokens had three OUTDENT tokens in a row, and
the last two overlapped with the
]
. Since at least one of those OUTDENT tokenswas considered part of the function body, the function expression had an ending
position just after the end of the
]
.OUTDENT tokens are sort of a weird case in the lexer anyway, since they often
don't correspond to an actual location in the source code. It seems like the
code in
lexer.coffee
makes an attempt at finding a good place for them, but insome cases, it has a bad result. This seems hard to avoid in the general case.
For example, in this code:
[-> a]
There must be an OUTDENT between the
a
and the]
, but CoffeeScript tokenshave an inclusive start and end, so they must always be at least one character
wide (I think). In this case, the lexer was choosing the
]
as the location,and the parser ended up generating correct location data, I believe because
it ignores the outermost INDENT and OUTDENT tokens. However, with multiple
OUTDENT tokens in a row, the parser ends up producing location data that is
wrong.
It seems to me like there isn't a solid answer to "what location do OUTDENT
tokens have", since it hasn't mattered much, but for this commit, I'm defining
it: they always have the location of the last character of the previous token.
This should hopefully be fairly safe because tokens are still in the same order
relative to each other. Also, it's worth noting that this makes the start
location for OUTDENT tokens awkward. However, OUTDENT tokens are always used to
mark the end of something, so their
last_line
andlast_column
values arealways what matter when determining AST node bounds, so it is most important for
those to be correct.