Skip to content

Commit 12147f7

Browse files
committed
Rewrite regexes where common prefix can be pulled out from alternation branches
Thanks to Michael Voříšek for suggesting this optimization. A new "rewrite" pass has been added to the regex compilation process. For now, the rewrite pass only optimized one type of regex: those where every branch of an alternation construct has a common prefix. In such cases, we rewrite the regex like so (for example): (abc|abd|abe) ⇒ (ab(?:c|d|e)) An extra non-capturing group is not introduced if the alternation is within a non-capturing group (which is not quantified using ?, *, or a similar suffix). In that case we simply do something like: (?:abc|abd|abe) ⇒ ab(?:c|d|e) In some edge cases, it is possible that rewriting a group with common alternation prefix might open up the opportunity to pull out more common prefixes. For example: (a(b|c)d|(ab|ac)e) In that case, if the group '(ab|ac)' was rewritten to pull out the common prefix, it would then become possible to pull out a common prefix from the top-level group. However, we do not take advantage of that opportunity. Further, we do not perform the rewrite in cases where the prefixes are semantically equivalent, but parse to a different parsed_pattern sequence. Groups which the regex engine might need to backtrack into are never pulled out, since this could change the order in which the regex engine considers possible ways of matching the pattern against the subject string, and could thus change the returned match. For example, this pattern will not be rewritten: ((?:a|b)c|(?:a|b)d) Also, callouts are never extracted even if they form a common prefix to an alternation. Some backtracking control verbs, like (*SKIP) and (*COMMIT), are never extracted either. A different type of rewrite is performed if an alternation construct matches only single, literal characters: (a|b|c) ⇒ ([a-c]) A new compile option, PCRE2_NO_PATTERN_REWRITE, has been added to skip the pattern rewrite phase when compiling a pattern.
1 parent ef218fb commit 12147f7

28 files changed

+3837
-525
lines changed

RunTest

+19-2
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,8 @@ title23="Test 23: \C disabled test"
8989
title24="Test 24: Non-UTF pattern conversion tests"
9090
title25="Test 25: UTF pattern conversion tests"
9191
title26="Test 26: Auto-generated unicode property tests"
92-
maxtest=26
92+
title27="Test 27: Pattern rewriter tests"
93+
maxtest=27
9394
titleheap="Test 'heap': Environment-specific heap tests"
9495

9596
if [ $# -eq 1 -a "$1" = "list" ]; then
@@ -120,6 +121,7 @@ if [ $# -eq 1 -a "$1" = "list" ]; then
120121
echo $title24
121122
echo $title25
122123
echo $title26
124+
echo $title27
123125
echo ""
124126
echo $titleheap
125127
echo ""
@@ -255,6 +257,7 @@ do23=no
255257
do24=no
256258
do25=no
257259
do26=no
260+
do27=no
258261
doheap=no
259262

260263
while [ $# -gt 0 ] ; do
@@ -286,6 +289,7 @@ while [ $# -gt 0 ] ; do
286289
24) do24=yes;;
287290
25) do25=yes;;
288291
26) do26=yes;;
292+
27) do27=yes;;
289293
heap) doheap=yes;;
290294
-8) arg8=yes;;
291295
-16) arg16=yes;;
@@ -437,7 +441,7 @@ if [ $do0 = no -a $do1 = no -a $do2 = no -a $do3 = no -a \
437441
$do12 = no -a $do13 = no -a $do14 = no -a $do15 = no -a \
438442
$do16 = no -a $do17 = no -a $do18 = no -a $do19 = no -a \
439443
$do20 = no -a $do21 = no -a $do22 = no -a $do23 = no -a \
440-
$do24 = no -a $do25 = no -a $do26 = no -a $doheap = no \
444+
$do24 = no -a $do25 = no -a $do26 = no -a $do27 = no -a $doheap = no \
441445
]; then
442446
do0=yes
443447
do1=yes
@@ -466,6 +470,7 @@ if [ $do0 = no -a $do1 = no -a $do2 = no -a $do3 = no -a \
466470
do24=yes
467471
do25=yes
468472
do26=yes
473+
do27=yes
469474
fi
470475

471476
# Handle any explicit skips at this stage, so that an argument list may consist
@@ -898,6 +903,18 @@ for bmode in "$test8" "$test16" "$test32"; do
898903
fi
899904
fi
900905

906+
# Pattern rewriter tests
907+
908+
if [ $do27 = yes ] ; then
909+
echo $title27
910+
if [ $utf -eq 0 ] ; then
911+
echo " Skipped because UTF-$bits support is not available"
912+
else
913+
$sim $valgrind ./pcre2test -q $setstack $bmode $testdata/testinput27 testtry
914+
checkresult $? 27 ""
915+
fi
916+
fi
917+
901918
# Manually selected heap tests - output may vary in different environments,
902919
# which is why that are not automatically run.
903920

doc/pcre2_compile.3

+1
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ The primary option bits are:
6565
theses (named ones available)
6666
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
6767
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
68+
PCRE2_NO_PATTERN_REWRITE Disable pattern rewriting optimizations
6869
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
6970
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
7071
(only relevant if PCRE2_UTF is set)

doc/pcre2api.3

+12
Original file line numberDiff line numberDiff line change
@@ -1768,6 +1768,18 @@ automatically anchored if PCRE2_DOTALL is set for all the .* items and
17681768
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
17691769
must start either at the start of the subject or following a newline is
17701770
remembered. Like other optimizations, this can cause callouts to be skipped.
1771+
.sp
1772+
PCRE2_NO_PATTERN_REWRITE
1773+
.sp
1774+
This option disables all optimizations which occur during the pattern rewriting
1775+
phase (after parsing but before compilation). Pattern rewriting may remove
1776+
redundant items, coalesce items, adjust group structure, or replace some
1777+
constructs with an equivalent construct. Pattern rewriting will never affect
1778+
which strings are and are not matched, or what substrings are captured by
1779+
capture groups. However, since it may change the structure of a pattern,
1780+
if you are tracing the matching process, you might prefer PCRE2 to use the
1781+
original pattern without rewriting. This option is also useful for testing.
1782+
Pattern rewriting is also disabled if PCRE2_AUTO_CALLOUT is set.
17711783
.sp
17721784
PCRE2_NO_START_OPTIMIZE
17731785
.sp

doc/pcre2callout.3

+3-1
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,9 @@ Callouts can be useful for tracking the progress of pattern matching. The
8383
program has a pattern qualifier (/auto_callout) that sets automatic callouts.
8484
When any callouts are present, the output from \fBpcre2test\fP indicates how
8585
the pattern is being matched. This is useful information when you are trying to
86-
optimize the performance of a particular pattern.
86+
optimize the performance of a particular pattern. However, note that some
87+
optimizations which adjust the structure of the pattern are disabled when
88+
automatic callouts are enabled.
8789
.
8890
.
8991
.SH "MISSING CALLOUTS"

doc/pcre2syntax.3

+1
Original file line numberDiff line numberDiff line change
@@ -414,6 +414,7 @@ appear. For the first three, d is a decimal number.
414414
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
415415
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
416416
(*NO_JIT) disable JIT optimization
417+
(*NO_PATTERN_REWRITE) disable pattern rewriting optimizations (PCRE2_NO_PATTERN_REWRITE)
417418
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
418419
(*UTF) set appropriate UTF mode for the library in use
419420
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)

doc/pcre2test.1

+1
Original file line numberDiff line numberDiff line change
@@ -623,6 +623,7 @@ for a description of the effects of these options.
623623
/n no_auto_capture set PCRE2_NO_AUTO_CAPTURE
624624
no_auto_possess set PCRE2_NO_AUTO_POSSESS
625625
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
626+
no_pattern_rewrite set PCRE2_NO_PATTERN_REWRITE
626627
no_start_optimize set PCRE2_NO_START_OPTIMIZE
627628
no_utf_check set PCRE2_NO_UTF_CHECK
628629
ucp set PCRE2_UCP

doc/pcre2test.txt

+1
Original file line numberDiff line numberDiff line change
@@ -604,6 +604,7 @@ PATTERN MODIFIERS
604604
/n no_auto_capture set PCRE2_NO_AUTO_CAPTURE
605605
no_auto_possess set PCRE2_NO_AUTO_POSSESS
606606
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
607+
no_pattern_rewrite set PCRE2_NO_PATTERN_REWRITE
607608
no_start_optimize set PCRE2_NO_START_OPTIMIZE
608609
no_utf_check set PCRE2_NO_UTF_CHECK
609610
ucp set PCRE2_UCP

src/pcre2.h.generic

+1
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ D is inspected during pcre2_dfa_match() execution
143143
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
144144
#define PCRE2_LITERAL 0x02000000u /* C */
145145
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
146+
#define PCRE2_NO_PATTERN_REWRITE 0x08000000u /* C */
146147

147148
/* An additional compile options word is available in the compile context. */
148149

src/pcre2.h.in

+1
Original file line numberDiff line numberDiff line change
@@ -143,6 +143,7 @@ D is inspected during pcre2_dfa_match() execution
143143
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
144144
#define PCRE2_LITERAL 0x02000000u /* C */
145145
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
146+
#define PCRE2_NO_PATTERN_REWRITE 0x08000000u /* C */
146147

147148
/* An additional compile options word is available in the compile context. */
148149

0 commit comments

Comments
 (0)