You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This patch adds a onepass matcher, which is a DFA that
has all the abilities of an NFA! There are lots
of expressions that a onepass matcher can't handle, namely
those cases where a regex contains non-determinism.
The general approach we take is as follows:
1. Check if a regex is onepass using `src/onepass.rs::is_onepass`.
2. Compile a new regex program using the compiler with the bytes
flag set.
3. Compile a onepass DFA from the program produced in step 2. We
will roughly map each instruction to a state in the DFA, though
instructions like `split` don't get states.
a. Make a new transition table for the first instruction.
b. For each child of the first instruction:
- If it is a bytes instruction, add a transition to
the table for every byte class in the instruction.
- If it is an instruction which consumes zero input
(like `EmptyLook` or `Save`), emit a job to a DAG asking to
forward the first instruction state to the state for
the non-consuming instruction.
- Push the child instruction to a queue of instructions to
process.
c. Peel off an instruction from the queue and go back to
step a, processing the instruction as if it was the
first instruction. If the queue is empty, continue with
step d.
d. Topologically sort the forwarding jobs, and shuffle
the transitions from the forwarding targets to the
forwarding sources in topological order.
e. Bake the intermediary transition tables down into a single
flat vector. States which require some action (`EmptyLook`
and `Save`) get an extra entry in the baked transition table
that contains metadata instructing them on how to perform
their actions.
4. Wait for the user to give us some input.
5. Execute the DFA:
- The inner loop is basically:
while at < text.len():
state_ptr = baked_table[text[at]]
at += 1
- There is a lot of window dressing to handle special states.
The idea of a onepass matcher comes from Russ Cox and
his RE2 library. I haven't been as good about reading
the RE2 source as I should have, but I've gotten the
impression that the RE2 onepass matcher is more in the
spirit of an NFA simulation without threads than a DFA.
Squashed Patch Notes
====================
There were a few issues and burrs that needed to be sanded down
in the original impl. They were fixed in a series of small patches
that are described below.
Fix bogus doctest.
The list formatting in the module comment for `src/onepass.rs`
was triggering a doctest. English != Rust, so this made `cargo test`
grumpy.
Thread only_utf8 through onepass to byte input.
word_boundary_unicode::ascii3 was failing because I wasn't
threading the correct only_utf8 value though to the actual
input object. This patch fixes that.
Drop empty branch restriction.
When I fist noticed the problem with empty branches in
alternatives, I added in a special case in the fset
intersection code to close the loop hole. Since then
I've implemented a more principled notion of regex
accepting the empty string, so the special case is no
longer needed. This patch removes that restriction.
Fix documentation and style issues.
This patch just has a bunch of style and doc fixes
that I noticed when going over the diff on github.
Flatten `onepass` member of the OnePassCompiler
Embedding the OnePass DFA to be compiled in the OnePassCompiler
caused a few values to be unnecessarily duplicated and added an
extra level of indirection. This patch resolves that issue and
takes advantage of these move semantics I'm always hearing about.
Factor OnePassCompiler::forwards into local var.
Iteration of a `Forwards` object is destructive,
which previously meant that we had to clone it in order
to iterate over it. Once the compiler iterates over the
forwarding jobs, it never touches them again, so this
was an extra copy. This patch plays a little type tetris
to get rid of that extra copy.
Filter out STATE_DEAD in eof single step.
STATE_DEAD has the STATE_MATCH flag set even though it does
not semantically indicate a match. This means that we have to
be very careful about when we check for the STATE_MATCH flag.
In the eof single step, just before the eof action drain loop,
I was forgetting to filter out the STATE_DEAD case, with predictably
bad results. This patch fixes that.
Add an unrollable inner loop
This patch adds an inner loop to the onepass DFA execution
which which is set up for unrolling. Right now it is unrolled
once, which isn't that interesting, but benchmarks will be
required to determine the right number of times to unroll
the loop. The inner loop does manage to avoid an extra
branch around when to increment `at` which is required
for the drain loop.
Clarify forwarding DAG edge situation.
Previously, the forwarding DAG was talked about both in terms
of states which need to be forwarded to other states, and in more
conventional graph theory terms. Forwarding one state to another
makes sense in terms of the DFA, but unfortunately the directionality
is exactly opposite the directionality present in the DAG we were
dealing with. This patch tries to cut down on the confusion that
this might have caused by renaming some variables and adding in
more comments.
Factor accepts_empty out of fset_of
Previously the only way to determine if a given expression
accepts the empty string was to compute the whole first set
and then check the flag on the fset. This resulted in a little
bit of wasted work because the set of accepting chars was also
computed. It is unlikely that there was much of a perf impact, so
this patch is mostly just unnecessary gardening. Nevertheless,
this patch removes that tiny bit of wasted work.
Update utf8 encoding to use new post regex-1.0 style!
0 commit comments