Discuss: Keyword boundaries #2429

joshgoebel · 2020-02-25T22:38:06Z

Is your request related to a specific problem you're having?

The solution you'd prefer / feature you'd like to see added...

I think perhaps languages need to be able to define their own keyword boundaries... we are considering adding \b|[space] to fix an issue with Clojure (because it allows - in keywords)... and then really (most of the time) keywords shouldn't be followed by . (then it's a method dispatch, not a keyword). Also, they really shouldn't be proceeded by . either (method name), but that would require negative look-behind [though eventually we'll get that.

So as I'm considering adding these things into the CORE highlighter as keyword boundaries it strikes me that I'm making some big assumptions and quite possibly adding grammar specific things to the core, which seems quite bad. If we support languages as diverse as Brainfuck then we should have as little "generic" knowledge of languages as possible baked into the core highlighter.

Even \b as a separator for ALL languages is wrong as we're starting to see cracks with that already with Clojure, etc.

As languages can already configure lexemes I think they also should be able to configure boundaries.

return {
  lexemes: /[a-z-]+/, // allow `-`
  lexemesBoundary: /\b/,
  // or
  lexemesBoundaryStart: /\b/,
  lexemesBoundaryEnd: /(\b|(?=\s))(?!\.)/,
};

The text was updated successfully, but these errors were encountered:

joshgoebel · 2020-02-25T22:42:40Z

In the recent past the parser assumed any expression like .keyword or keyword. could not be a keyword... recent parser changes broke this behavior (see #2428) but I'm not sure that was correct core behavior to begin with (though it might be a reasonable default). Definitely there were issues with how it was implemented and the fact that it was completely undocumented.

joshgoebel · 2020-02-25T23:00:51Z

Here is my issue with the prior behavior . keyword behavior:

You write:

contains: [
  {beginKeywords: "mud dirt gook"}
]

But what you really get [behavior] is essentially:

contains: [
  { begin: /(mud|dirt|gook)\./ }, // first eat up any keywords.
  { begin: /\.(mud|dirt|gook)/ }, // first eat up any .keywords
  {beginKeywords: "mud dirt gook"
]

That solves the problem but it's implicit vs explicit and I don't think the behavior is expected/predicable. There is no obvious reason why this shouldn't work:

contains: [
  { beginKeywords: "mud dirt gook" },
  { begin: ".dirt", className: meta" } 
]

But it will fail, because the first mode will eat the . and advance the cursor 1 forward when it fails to find a match. Trial and error might teach you to put it first, but it's not clear at all WHY that would be so from reading the grammar itself. Actually the behavior may well have worked with the prior engine because of how it had to retry every regex after a match was found.

So the first rule matches via the terminate regex (which includes the .), but when retrying the first rule fails because this time the regex contains no dot (that's how the trick worked). So it would move on to the next rule, which would result in a match. This same behavior isn't available with the current engine though.

Vs say:

contains: [
  hljs.METHOD_GUARD,
  { beginKeywords: "mud dirt gook" },
  { begin: ".dirt", className: meta" } // the .dirt macro has special meaning
]

Now at least the problem is visible when one looks at what METHOD_GUARD does. I think what I'm getting at here is that hidden rules should never eat up content. And this was the only place in the whole parser where that was the case.

joshgoebel · 2020-02-25T23:41:15Z

I'm not suggesting METHOD_GUARD is how we should solve this. Both solutions are kludgy. Honestly you'd want this to 'just work'. But for the prefix case we really need negative look-behinds in order to do this all with regex. Otherwise we're left building our own custom negative look-behind stuff - and that makes letting users override it with regex (as the original idea suggests) much harder.

For example you might imagine:

contains: [
  {beginKeywords: "mud dirt gook", notAfterChar: "." }
]

And of course beginKeywords could default notAfterChar... but I hate to add more options to the grammars that we know are just going to be killed in a few years by Javascript moving forward. And actually this idea is broken too because of how we aggregate the rules into a single regex... there isn't a great way to dynamically filter out an given rule (to have it match but then not match and continue checking other rules)... the whole engine is based on the premise that regex is enough to get the job done.

joshgoebel · 2020-02-26T00:03:10Z

@egor-rogov @allejo Any thoughts? CC @marcoscaceres

I'm surprised we have no tests for this behavior (when we sort it out we should add some)... but it wouldn't just be Java that is broken... any grammar relying on the previous magic . beginKeywords behavior could now be broken (to varying degrees). The fact that no one has reported it until now is either a really good or really bad sign. :-/

We can fix the . postfix behavior with a simple negative lookahead if we want, but the sticky problem is handling the prefix case (. out front of a keyword).

We could probably (short-term) add the previously problematic behavior (or some variant) back if we had to, but ugh. I'd love if there were some alternative suggestions.

MANY MANY grammars use beginKeywords, only 2 use METHOD_GUARD presently.

joshgoebel · 2020-02-26T01:29:26Z

Another note, these boundaries (speaking of the magic . boundaries) have only ever been applied to beginKeywords... in fact I think that's a key differentiator from the normal keywords, but I wonder if that actually makes sense or if most people are aware of it. IE, keywords: "false" would highlight false in all the following lines:

.false.
.false
false.

From the docs:

Used instead of ``begin`` for modes starting with keywords to avoid needless repetition:

::
  {
    begin: '\\b(extends|implements) ',
    keywords: 'extends implements'
  }

Where-as I'd wager that 99%+ of the time what you really want in ALL cases is actually [boundary]keyword[boundary]. Where boundary included the "not dot" provision. The exception would be using keywords to match method dispatch, but you could do this with a custom boundary - and it would probably be better to specify which keywords were actual keywords and which were method names.

Slight change of topic. The existing keyword system already matches word boundaries since the default lexeme is \w+ and it's implied that the "character" immediately to either side of \w+ is going to be \b... When I made this explicit vs implicit (ie, applying it to ALL lexemes) only 5 tests broke.

So actually, with keywords you already have an opportunity to specify custom boundaries (by including them in the lexeme regex), so maybe that is a clue here... hmmm... Although that only works if other modes haven't eaten the before and after content. (since keywords are only processed inside of a rule's un-matched content - or between matches).

All of them weird exceptions that could be thought about and handled if we were going to much about with what keywords and boundaries meant on a deeper level.

.keyword
%keyword
%% (maxima)
@keyword
+ (scheme, it's debatable if this should even be a keyword)

IE, cases where a custom boundary would need to be defined since a word boundary doesn't exist between these symbols and whitespace (or other symbols).

joshgoebel · 2020-02-26T02:51:47Z

Further notes:

The original \b boundaries for beginKeywords were only added as a result of deficiencies in the original parsing engine (that we have since fixed): 9dca4cc

Wrap "begin" derived from "beginKeywords" in \b instead of doing it only for terminators. If "begin" doesn't have it testRe doesn't see it and can recognize a shorter "begin" (e.g. beginKeywords: 'def' passes testRe when beginKeywords: 'defmodule' is found by terminators).

And the original . fixes context:

beginKeywords picks up keywords after dots #345 beginKeywords picks up keywords after dots

Then a series of commits to fix:

Original fix: d83af73 and then revert cb87725
d794611
d510ac8

The last one being when the \. pinning on either side of the regex was added. So the behavior was definitely intentional, but still very much undocumented anywhere (that I can see).

egor-rogov · 2020-02-26T21:50:48Z

I'm afraid it's all too deep for me to understand without spending days digging through the parser...
I just want to say that most of the languages use simple alphabetic keywords. Of course it would be great to have a solution that "just works" all the time, but on the other hand It doesn't look totally unfair if you have to use some "tricks" for several fancy languages.

joshgoebel · 2020-02-27T13:25:16Z

In this case I'm not sure Java is a "fancy language"... I'll presume you're referring to the complexities of Clojure when you say "fancy".

So if we want to try and start breaking this down to simpler/smaller questions:

Should the prior behavior of beginKeywords excluding keywords starting with . and ending with . be preserved?
IF SO, isn't it strange that beginKeywords has entirely different behavior than keywords?
I think one problem is in many languages we use keywords specifically for method calls. But I'm not sure this is purely a docs issue. Though perhaps now this stuff is so "legacy" that it's too late to change it.

One proposal:

Make keywords and beginKeyword more similar, ie follow the same . rules.
Add a way to purposely say you're adding method keywords (don't get hung up on the syntax):

keywordRules: [
 { name: "built_ins", begin: '\.', excludeBegin: true, list: "toUpperCase toLowerCase map filter ..." }

This would allow keyword matchers to be mini-modes, rather than just a single regex, which is actually a useful idea in it's own right. This is also slightly annoying though in that it's ALMOST solvable with just lexemes if we had regex look-behind. IE, you could just do:

{ lexemes: '\.\w+` }

Though you still need a way to say "don't highlight the .".

This is probably a lot of effort though regarding legacy understanding and updating all existing grammars, etc.

Another idea:

deprecate beginKeyword and come up with a BETTER name and document how it's significantly different from keywords

This becomes a simple search and replace type operation, though I'm not 100% sure what to propose as better naming.

joshgoebel · 2020-02-27T13:28:35Z

I think I'd first like to think about what makes the most sense (trying to ignore legacy) if we weren't limited technically - and then see what that type of solution might look like. To me that means keywords need to be somehow more configurable in general.

joshgoebel · 2020-02-27T13:37:53Z

Actually this is a pretty great idea:

{ lexemes: '\.(\w+)` }

First capture group gets the highlight. Although most languages would need multiple lexemes expressions, so you could specify different matches for different types of keywords.

joshgoebel · 2020-02-27T13:42:47Z

This thread might be one of the longer lived ones. I'm going to circle back and see about finding some temp fix to re-add the old . behavior to beginKeyword for now, but long-term still think there is a lot of room for improvement with regard to keywords.

joshgoebel added enhancement An enhancement or new feature parser labels Feb 25, 2020

joshgoebel mentioned this issue Feb 27, 2020

(bug) Fix beginKeywords to ignore . matches #2434

Merged

joshgoebel mentioned this issue Sep 5, 2020

bug(bash) Options (like -lt etc) just don't work and it looks like they never did. #2673

Closed

joshgoebel added the discuss/propose Proposal for a new feature/direction label Jan 7, 2021

joshgoebel added the autoclose Flag things to future autoclose. label Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss: Keyword boundaries #2429

Discuss: Keyword boundaries #2429

joshgoebel commented Feb 25, 2020 •

edited

Loading

joshgoebel commented Feb 25, 2020 •

edited

Loading

joshgoebel commented Feb 25, 2020 •

edited

Loading

joshgoebel commented Feb 25, 2020 •

edited

Loading

joshgoebel commented Feb 26, 2020 •

edited

Loading

joshgoebel commented Feb 26, 2020 •

edited

Loading

joshgoebel commented Feb 26, 2020 •

edited

Loading

egor-rogov commented Feb 26, 2020

joshgoebel commented Feb 27, 2020 •

edited

Loading

joshgoebel commented Feb 27, 2020

joshgoebel commented Feb 27, 2020 •

edited

Loading

joshgoebel commented Feb 27, 2020

Discuss: Keyword boundaries #2429

Discuss: Keyword boundaries #2429

Comments

joshgoebel commented Feb 25, 2020 • edited Loading

joshgoebel commented Feb 25, 2020 • edited Loading

joshgoebel commented Feb 25, 2020 • edited Loading

joshgoebel commented Feb 25, 2020 • edited Loading

joshgoebel commented Feb 26, 2020 • edited Loading

joshgoebel commented Feb 26, 2020 • edited Loading

joshgoebel commented Feb 26, 2020 • edited Loading

egor-rogov commented Feb 26, 2020

joshgoebel commented Feb 27, 2020 • edited Loading

joshgoebel commented Feb 27, 2020

joshgoebel commented Feb 27, 2020 • edited Loading

joshgoebel commented Feb 27, 2020

joshgoebel commented Feb 25, 2020 •

edited

Loading

joshgoebel commented Feb 25, 2020 •

edited

Loading

joshgoebel commented Feb 25, 2020 •

edited

Loading

joshgoebel commented Feb 25, 2020 •

edited

Loading

joshgoebel commented Feb 26, 2020 •

edited

Loading

joshgoebel commented Feb 26, 2020 •

edited

Loading

joshgoebel commented Feb 26, 2020 •

edited

Loading

joshgoebel commented Feb 27, 2020 •

edited

Loading

joshgoebel commented Feb 27, 2020 •

edited

Loading