Skip to content
This repository was archived by the owner on May 29, 2023. It is now read-only.

Commit 783f57a

Browse files
raphlinusdavelab6
authored andcommitted
Moved to new location (#52)
1 parent 8b6ea5c commit 783f57a

File tree

1 file changed

+1
-147
lines changed

1 file changed

+1
-147
lines changed

README.md

+1-147
Original file line numberDiff line numberDiff line change
@@ -1,154 +1,8 @@
11
# fancy-regex
22

3-
A Rust library for compiling and matching regular expressions. It uses a hybrid
4-
regex implementation designed to support a relatively rich set of features.
5-
In particular, it uses backtracking to implement "fancy" features such as
6-
look-around and backtracking, which are not supported in purely
7-
NFA-based implementations (exemplified by
8-
[RE2](https://github.com/google/re2), and implemented in Rust in the
9-
[regex](https://crates.io/crates/regex) crate).
10-
11-
[![crate](https://img.shields.io/crates/v/fancy-regex.svg)](https://crates.io/crates/fancy-regex)
12-
[![build status](https://travis-ci.org/google/fancy-regex.svg?branch=master)](https://travis-ci.org/google/fancy-regex)
13-
[![codecov](https://codecov.io/gh/google/fancy-regex/branch/master/graph/badge.svg)](https://codecov.io/gh/google/fancy-regex)
14-
15-
A goal is to be as efficient as possible. For a given regex, the NFA
16-
implementation has asymptotic running time linear in the length of the
17-
input, while in the general case a backtracking implementation has
18-
exponential blowup. An example given in [Static Analysis for Regular
19-
Expression Exponential Runtime via Substructural
20-
Logics](https://www.cs.bham.ac.uk/~hxt/research/redos_full.pdf) is:
21-
22-
```python
23-
import re
24-
re.compile('(a|b|ab)*bc').match('ab' * 28 + 'ac')
25-
```
26-
27-
In Python (tested on both 2.7 and 3.5), this match takes 91s, and
28-
doubles for each additional repeat of 'ab'.
29-
30-
Thus, many proponents
31-
[advocate](https://swtch.com/~rsc/regexp/regexp1.html) a purely NFA
32-
(nondeterministic finite automaton) based approach. Even so,
33-
backreferences and look-around do add richness to regexes, and they
34-
are commonly used in applications such as syntax highlighting for text
35-
editors. In particular, TextMate's [syntax
36-
definitions](https://manual.macromates.com/en/language_grammars),
37-
based on the [Oniguruma](https://github.com/kkos/oniguruma)
38-
backtracking engine, are now used in a number of other popular
39-
editors, including Sublime Text and Atom. These syntax definitions
40-
routinely use backreferences and look-around. For example, the
41-
following regex captures a single-line Rust raw string:
42-
43-
```
44-
r(#*)".*?"\1
45-
```
46-
47-
There is no NFA that can express this simple and useful pattern. Yet,
48-
a backtracking implementation handles it efficiently.
49-
50-
This package is one of the first that handles both cases well. The
51-
exponential blowup case above is run in 258ns. Thus, it should be a
52-
very appealing alternative for applications that require both richness
53-
and performance.
54-
55-
## A warning about worst-case performance
56-
57-
NFA-based approaches give strong guarantees about worst-case
58-
performance. For regexes that contain "fancy" features such as
59-
backreferences and look-around, this module gives no corresponding
60-
guarantee. If an attacker can control the regular expressions that
61-
will be matched against, they will be able to successfully mount a
62-
denial-of-service attack. Be warned.
63-
64-
See [PERFORMANCE.md](PERFORMANCE.md) for some examples.
65-
66-
## A hybrid approach
67-
68-
One workable approach is to detect the presence of "fancy" features,
69-
and choose either an NFA implementation or a backtracker depending on
70-
whether they are used.
71-
72-
However, this module attempts to be more fine-grained. Instead, it
73-
implements a true hybrid approach. In essence, it is a backtracking VM
74-
(as well explained in [Regular Expression Matching: the Virtual
75-
Machine Approach](https://swtch.com/~rsc/regexp/regexp2.html)) in
76-
which one of the "instructions" in the VM delegates to an inner NFA
77-
implementation (in Rust, the regex crate, though a similar approach
78-
would certainly be possible using RE2 or the Go
79-
[regexp](https://golang.org/pkg/regexp/) package). Then there's an
80-
analysis which decides for each subexpression whether it is "hard", or
81-
can be delegated to the NFA matcher. At the moment, it is eager, and
82-
delegates as much as possible to the NFA engine.
83-
84-
## Theory
85-
86-
**(This section is written in a somewhat informal style; I hope to
87-
expand on it)**
88-
89-
The fundamental idea is that it's a backtracking VM like PCRE, but as
90-
much as possible it delegates to an "inner" RE engine like RE2 (in
91-
this case, the Rust one). For the sublanguage not using fancy
92-
features, the library becomes a thin wrapper.
93-
94-
Otherwise, you do an analysis to figure out what you can delegate and
95-
what you have to backtrack. I was thinking it might be tricky, but
96-
it's actually quite simple. The first phase, you just label each
97-
subexpression as "hard" (groups that get referenced in a backref,
98-
look-around, etc), and bubble that up. You also do a little extra
99-
analysis, mostly determining whether an expression has constant match
100-
length, and the minimum length.
101-
102-
The second phase is top down, and you carry a context, also a boolean
103-
indicating whether it's "hard" or not. Intuitively, a hard context is
104-
one in which the match length will affect future backtracking.
105-
106-
If the subexpression is easy and the context is easy, generate an
107-
instruction in the VM that delegates to the inner NFA implementation.
108-
Otherwise, generate VM code as in a backtracking engine. Most
109-
expression nodes are pretty straightforward; the only interesting case
110-
is concat (a sequence of subexpressions).
111-
112-
Even that one is not terribly complex. First, determine a prefix of
113-
easy nodes of constant match length (this won't affect backtracking,
114-
so safe to delegate to NFA). Then, if your context is easy, determine
115-
a suffix of easy nodes. Both of these delegate to NFA. For the ones in
116-
between, recursively compile. In an easy context, the last of these
117-
also gets an easy context; everything else is generated in a hard
118-
context. So, conceptually, hard context flows from right to left, and
119-
from parents to children.
120-
121-
## Current status
122-
123-
Still in development, though the basic ideas are in place. Currently,
124-
the following features are missing:
125-
126-
* Support for named captures (including in the API)
127-
128-
* The following regex language features not yet implemented:
129-
130-
* Atomic groups
131-
132-
* Procedure calls and recursive expressions
133-
134-
## Acknowledgements
135-
136-
Many thanks to [Andrew Gallant](http://blog.burntsushi.net/about/) for
137-
stimulating conversations that inspired this approach, as well as for
138-
creating the excellent regex crate.
139-
140-
## Authors
141-
142-
The main author is Raph Levien.
143-
144-
## Contributions
145-
146-
We gladly accept contributions via GitHub pull requests, as long as the author
147-
has signed the Google Contributor License. Please see CONTRIBUTIONS.md for
148-
more details.
3+
Development of this project has moved to [fancy-regex/fancy-regex](https://github.com/fancy-regex/fancy-regex).
1494

1505
### Disclaimer
1516

1527
This is not an official Google product (experimental or otherwise), it
1538
is just code that happens to be owned by Google.
154-

0 commit comments

Comments
 (0)