Skip to content

Musings on regex literals #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Oct 7, 2021
Merged

Conversation

milseman
Copy link
Member

@milseman milseman commented Oct 2, 2021

No description provided.

@milseman milseman changed the title Musings on library-extensible regex literals Musings on regex literals Oct 4, 2021
@milseman milseman merged commit 6eefeae into swiftlang:main Oct 7, 2021
@milseman milseman deleted the literally_regex branch October 7, 2021 23:14

- Go with a typical regex literal instead of something custom/nicer
+ Main reason for regex is familiarity and broad appeal
+ If we're building something custom, let's not do it on top of the shaky technical foundation that is regex
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I'd describe regular expressions as having a "shaky technical foundation", FWIW. Sure, the common backtracking implementations, possibly, as they've added all kinds of ad-hoc extensions, but regular expressions per se are a compact notation for regular grammars and have a very solid foundation in formal language theory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually am referring to formal regular expressions and languages as the shaky foundation for this work. They're fine for computational linguistics and of course they have a rich history in formal languages, but there are 2 main ways in which they suck for software systems:

  1. They're generative
  2. Regular languages are not that interesting

Generative systems tend towards ambiguity and non-determinism. They are ill suited for writing parsers, as they don't describe how parsing happens but rather the set of acceptable parse inputs. They typically will need to be compiled and converted into a recognition system, and through this conversion a lot of complexity gets thrust on the programmer in practice (e.g. rules for disambiguation, how to plumb error handling and recovery through, how the thing actually runs, etc).

Regular languages aren't that interesting. They are too powerful to compose well, e.g. a + b, where any * in b interacts non-linearly with any * in a. Contrast this with an algebra defined over star-free languages (i.e. star-height=0), which allows for straight-forward composition. And yet they're not powerful enough to discern the kinds of structures we commonly have in textual formats. Regular expressions as a formulation of regular languages are additionally problematic, as they can't sanely express some very simple finite automata that could be expressed as right-linear grammars. Also, you either don't have a complement operator or complement doesn't do what one would normally think it does.

It makes sense that Unix pulled in regular expressions from formal language theory, as they were powerful enough for the kinds of search over minimally-structured text that was common then (as well as terseness). There also weren't really any non-generative systems available at that time. Though, as soon as they were applied, now how a match happens becomes important so determinism has to be provided, and doing this can break the very properties of regular expressions. E.g., IIRC, POSIX's disambiguation rules make concatenation non-associative.

And of course it's not long before you want to recognize slightly-less-minimal structure, and you want to extend regular expressions for a role they very much were not designed for. Contrast with something like PEGs, which are already powerful enough to recognize contexts free and some context-sensitive languages, and yet you can extend them with back references without making matching NP-complete.


Every single "feature" like a character class or some other meta-thingy, looks for a corresponding function definition on a conforming type. That is, we parse the regex, even providing intended semantics, while the conformer implements this. If the conformer doesn't provide a function definition, we generate a compilation error. Thus, conformers encode their feature set through ad-hoc function declarations, just like custom string interpolations.
Concern #2 is... **TBD**. One example is if you're trying to run with grapheme cluster semantics, scalar properties aren't available (at least, beyond the subset that Swift can meaningfully prescribe grapheme cluster semantics for). APIs probably need some way to enforce this statically (and/or dynamically with traps).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another fun example where Unicode can be painful is if you have a character class like [A-Zß] and want to do case-insensitive matching. In an implementation I wrote previously, character classes like that were mapped to subexpressions like (?:[a-zß]|ss) when in case insensitive mode. Thus (?i)[a-zß]+ would match "straße", "STRASSE", "strasse" and even "StraSse".

This also creates fun with captures, e.g. consider (?i)fu(s)(s)ball matching against fußball. I made that work by having it match and then capturing the "ß" in group 1, with group 2 empty.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, the whole Unicode-is-universal or Unicode-is-algebraic thing kinda falls apart pretty quickly. The stdlib can provide some case conversion and case insensitivity, but we're starting to more formally delineate it as something that can only provide universal semantics and doesn't incorporate locale or large data such as language dictionaries for word breaking. Case conversion can change the count of a String for ß. But, we very much want to allow a higher level linguistic framework to use regex literals (with their specified feature subset) to do better matching.

We might go so far as to exclude anything that isn't ASCII from a regex literal character class range for the stdlib's conformance. Stdlib ordering is not particularly meaningful outside of use for programmer invariants. But you could imagine a framework conformer that does incorporate both locale and application domain (e.g. is this a German phonebook or a German dictionary?... because it actually matters ....).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants