You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rewrite regexes where common prefix can be pulled out from alternation branches
Thanks to Michael Voříšek for suggesting this optimization.
A new "rewrite" pass has been added to the regex compilation process.
For now, the rewrite pass only optimized one type of regex: those where
every branch of an alternation construct has a common prefix.
In such cases, we rewrite the regex like so (for example):
(abc|abd|abe) ⇒ (ab(?:c|d|e))
An extra non-capturing group is not introduced if the alternation
is within a non-capturing group (which is not quantified using ?, *, or
a similar suffix). In that case we simply do something like:
(?:abc|abd|abe) ⇒ ab(?:c|d|e)
In some edge cases, it is possible that rewriting a group with common
alternation prefix might open up the opportunity to pull out more common
prefixes. For example:
(a(b|c)d|(ab|ac)e)
In that case, if the group '(ab|ac)' was rewritten to pull out the
common prefix, it would then become possible to pull out a common
prefix from the top-level group. However, we do not take advantage of
that opportunity.
Further, we do not perform the rewrite in cases where the prefixes are
semantically equivalent, but parse to a different parsed_pattern
sequence.
Groups which the regex engine might need to backtrack into are never
pulled out, since this could change the order in which the regex
engine considers possible ways of matching the pattern against the
subject string, and could thus change the returned match. For
example, this pattern will not be rewritten:
((?:a|b)c|(?:a|b)d)
Also, callouts are never extracted even if they form a common prefix
to an alternation. Some backtracking control verbs, like (*SKIP) and
(*COMMIT), are never extracted either.
A different type of rewrite is performed if an alternation construct
matches only single, literal characters:
(a|b|c) ⇒ ([a-c])
A new compile option, PCRE2_NO_PATTERN_REWRITE, has been added to
skip the pattern rewrite phase when compiling a pattern.
0 commit comments