-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Discuss: Keyword boundaries #2429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In the recent past the parser assumed any expression like |
Here is my issue with the prior behavior You write: contains: [
{beginKeywords: "mud dirt gook"}
] But what you really get [behavior] is essentially: contains: [
{ begin: /(mud|dirt|gook)\./ }, // first eat up any keywords.
{ begin: /\.(mud|dirt|gook)/ }, // first eat up any .keywords
{beginKeywords: "mud dirt gook"
] That solves the problem but it's implicit vs explicit and I don't think the behavior is expected/predicable. There is no obvious reason why this shouldn't work: contains: [
{ beginKeywords: "mud dirt gook" },
{ begin: ".dirt", className: meta" }
]
So the first rule matches via the Vs say: contains: [
hljs.METHOD_GUARD,
{ beginKeywords: "mud dirt gook" },
{ begin: ".dirt", className: meta" } // the .dirt macro has special meaning
] Now at least the problem is visible when one looks at what METHOD_GUARD does. I think what I'm getting at here is that hidden rules should never eat up content. And this was the only place in the whole parser where that was the case. |
I'm not suggesting For example you might imagine:
And of course beginKeywords could default |
@egor-rogov @allejo Any thoughts? CC @marcoscaceres I'm surprised we have no tests for this behavior (when we sort it out we should add some)... but it wouldn't just be Java that is broken... any grammar relying on the previous magic We can fix the We could probably (short-term) add the previously problematic behavior (or some variant) back if we had to, but ugh. I'd love if there were some alternative suggestions. MANY MANY grammars use |
Another note, these boundaries (speaking of the magic
From the docs:
Where-as I'd wager that 99%+ of the time what you really want in ALL cases is actually Slight change of topic. The existing So actually, with All of them weird exceptions that could be thought about and handled if we were going to much about with what keywords and boundaries meant on a deeper level.
IE, cases where a custom boundary would need to be defined since a word boundary doesn't exist between these symbols and whitespace (or other symbols). |
Further notes: The original
And the original
Then a series of commits to fix: The last one being when the |
I'm afraid it's all too deep for me to understand without spending days digging through the parser... |
In this case I'm not sure Java is a "fancy language"... I'll presume you're referring to the complexities of Clojure when you say "fancy". So if we want to try and start breaking this down to simpler/smaller questions:
One proposal:
keywordRules: [
{ name: "built_ins", begin: '\.', excludeBegin: true, list: "toUpperCase toLowerCase map filter ..." } This would allow keyword matchers to be mini-modes, rather than just a single regex, which is actually a useful idea in it's own right. This is also slightly annoying though in that it's ALMOST solvable with just lexemes if we had regex look-behind. IE, you could just do: { lexemes: '\.\w+` } Though you still need a way to say "don't highlight the .". This is probably a lot of effort though regarding legacy understanding and updating all existing grammars, etc. Another idea:
This becomes a simple search and replace type operation, though I'm not 100% sure what to propose as better naming. |
I think I'd first like to think about what makes the most sense (trying to ignore legacy) if we weren't limited technically - and then see what that type of solution might look like. To me that means keywords need to be somehow more configurable in general. |
Actually this is a pretty great idea: { lexemes: '\.(\w+)` } First capture group gets the highlight. Although most languages would need multiple |
This thread might be one of the longer lived ones. I'm going to circle back and see about finding some temp fix to re-add the old |
Is your request related to a specific problem you're having?
#2420
#2428
The solution you'd prefer / feature you'd like to see added...
I think perhaps languages need to be able to define their own keyword boundaries... we are considering adding
\b|[space]
to fix an issue with Clojure (because it allows-
in keywords)... and then really (most of the time) keywords shouldn't be followed by.
(then it's a method dispatch, not a keyword). Also, they really shouldn't be proceeded by.
either (method name), but that would require negative look-behind [though eventually we'll get that.So as I'm considering adding these things into the CORE highlighter as keyword boundaries it strikes me that I'm making some big assumptions and quite possibly adding grammar specific things to the core, which seems quite bad. If we support languages as diverse as Brainfuck then we should have as little "generic" knowledge of languages as possible baked into the core highlighter.
Even
\b
as a separator for ALL languages is wrong as we're starting to see cracks with that already with Clojure, etc.As languages can already configure
lexemes
I think they also should be able to configure boundaries.The text was updated successfully, but these errors were encountered: