Musings on regex literals #4

milseman · 2021-10-02T18:11:18Z

No description provided.

al45tair · 2021-10-08T09:55:31Z

Documentation/LibraryExtensibleRegexLiterals.md

+
+- Go with a typical regex literal instead of something custom/nicer
+    + Main reason for regex is familiarity and broad appeal
+    + If we're building something custom, let's not do it on top of the shaky technical foundation that is regex


I'm not sure I'd describe regular expressions as having a "shaky technical foundation", FWIW. Sure, the common backtracking implementations, possibly, as they've added all kinds of ad-hoc extensions, but regular expressions per se are a compact notation for regular grammars and have a very solid foundation in formal language theory.

I actually am referring to formal regular expressions and languages as the shaky foundation for this work. They're fine for computational linguistics and of course they have a rich history in formal languages, but there are 2 main ways in which they suck for software systems:

They're generative

Regular languages are not that interesting

Generative systems tend towards ambiguity and non-determinism. They are ill suited for writing parsers, as they don't describe how parsing happens but rather the set of acceptable parse inputs. They typically will need to be compiled and converted into a recognition system, and through this conversion a lot of complexity gets thrust on the programmer in practice (e.g. rules for disambiguation, how to plumb error handling and recovery through, how the thing actually runs, etc).

Regular languages aren't that interesting. They are too powerful to compose well, e.g. a + b, where any * in b interacts non-linearly with any * in a. Contrast this with an algebra defined over star-free languages (i.e. star-height=0), which allows for straight-forward composition. And yet they're not powerful enough to discern the kinds of structures we commonly have in textual formats. Regular expressions as a formulation of regular languages are additionally problematic, as they can't sanely express some very simple finite automata that could be expressed as right-linear grammars. Also, you either don't have a complement operator or complement doesn't do what one would normally think it does.

It makes sense that Unix pulled in regular expressions from formal language theory, as they were powerful enough for the kinds of search over minimally-structured text that was common then (as well as terseness). There also weren't really any non-generative systems available at that time. Though, as soon as they were applied, now how a match happens becomes important so determinism has to be provided, and doing this can break the very properties of regular expressions. E.g., IIRC, POSIX's disambiguation rules make concatenation non-associative.

And of course it's not long before you want to recognize slightly-less-minimal structure, and you want to extend regular expressions for a role they very much were not designed for. Contrast with something like PEGs, which are already powerful enough to recognize contexts free and some context-sensitive languages, and yet you can extend them with back references without making matching NP-complete.

al45tair · 2021-10-08T10:04:33Z

Documentation/LibraryExtensibleRegexLiterals.md


-Every single "feature" like a character class or some other meta-thingy, looks for a corresponding function definition on a conforming type. That is, we parse the regex, even providing intended semantics, while the conformer implements this. If the conformer doesn't provide a function definition, we generate a compilation error. Thus, conformers encode their feature set through ad-hoc function declarations, just like custom string interpolations.
+Concern #2 is... **TBD**. One example is if you're trying to run with grapheme cluster semantics, scalar properties aren't available (at least, beyond the subset that Swift can meaningfully prescribe grapheme cluster semantics for). APIs probably need some way to enforce this statically (and/or dynamically with traps).


Another fun example where Unicode can be painful is if you have a character class like [A-Zß] and want to do case-insensitive matching. In an implementation I wrote previously, character classes like that were mapped to subexpressions like (?:[a-zß]|ss) when in case insensitive mode. Thus (?i)[a-zß]+ would match "straße", "STRASSE", "strasse" and even "StraSse".

This also creates fun with captures, e.g. consider (?i)fu(s)(s)ball matching against fußball. I made that work by having it match and then capturing the "ß" in group 1, with group 2 empty.

Yup, the whole Unicode-is-universal or Unicode-is-algebraic thing kinda falls apart pretty quickly. The stdlib can provide some case conversion and case insensitivity, but we're starting to more formally delineate it as something that can only provide universal semantics and doesn't incorporate locale or large data such as language dictionaries for word breaking. Case conversion can change the count of a String for ß. But, we very much want to allow a higher level linguistic framework to use regex literals (with their specified feature subset) to do better matching.

We might go so far as to exclude anything that isn't ASCII from a regex literal character class range for the stdlib's conformance. Stdlib ordering is not particularly meaningful outside of use for programmer invariants. But you could imagine a framework conformer that does incorporate both locale and application domain (e.g. is this a German phonebook or a German dictionary?... because it actually matters ....).

milseman added 2 commits October 2, 2021 12:06

Musings on library-extensible regex literals

5f3d409

Update LibraryExtensibleRegexLiterals.md

5a553b7

milseman changed the title ~~Musings on library-extensible regex literals~~ Musings on regex literals Oct 4, 2021

Update LibraryExtensibleRegexLiterals.md

1aaac9c

milseman merged commit 6eefeae into swiftlang:main Oct 7, 2021

milseman deleted the literally_regex branch October 7, 2021 23:14

al45tair reviewed Oct 8, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Musings on regex literals #4

Musings on regex literals #4

milseman commented Oct 2, 2021

al45tair Oct 8, 2021

milseman Oct 8, 2021

al45tair Oct 8, 2021

milseman Oct 8, 2021


		Every single "feature" like a character class or some other meta-thingy, looks for a corresponding function definition on a conforming type. That is, we parse the regex, even providing intended semantics, while the conformer implements this. If the conformer doesn't provide a function definition, we generate a compilation error. Thus, conformers encode their feature set through ad-hoc function declarations, just like custom string interpolations.
		Concern #2 is... TBD. One example is if you're trying to run with grapheme cluster semantics, scalar properties aren't available (at least, beyond the subset that Swift can meaningfully prescribe grapheme cluster semantics for). APIs probably need some way to enforce this statically (and/or dynamically with traps).

Musings on regex literals #4

Musings on regex literals #4

Conversation

milseman commented Oct 2, 2021

al45tair Oct 8, 2021

Choose a reason for hiding this comment

milseman Oct 8, 2021

Choose a reason for hiding this comment

al45tair Oct 8, 2021

Choose a reason for hiding this comment

milseman Oct 8, 2021

Choose a reason for hiding this comment