Skip to content

Commit 29b9195

Browse files
committed
Center for rewriting regex who can't ASCII good and memory issues too
Per previous commits on the rough comparisons between regex-filtered and re2, while regex-filtered is very competitive indeed on the CPU side it suffers from memory usage issues. This stems from two issues: character classes ================= `re2` uses [ASCII-only perl character classes][1], regex uses [full-unicode Perl character classes][2] defined in terms of [UTS#18 properties][3], this leads to *much* large state graphs for `\d` and `\w` (`\s` seems to cause much less trouble). While uap-core doesn't actually specify regex semantics, [Javascript perl character classes are ASCII-only][4]. As such, a valid mitigation *for ua-parser* is to convert `\d`, `\D`, `\w`, and `\W` to the corresponding ASCII classes (I used literal enumerations from MDN but POSIX-style classes [would have worked too][5]). This was way helped by regex supporting [nesting enumerated character classes][6] as it means I don't need to special-case expanding perl-style character classes inside enumerations. Because capture amplifies the issue, this conversion reduces memory consumption by between 30% for non-captured digits: > echo -n "\d+" | cargo r -qr -- -q 13496 8826 > echo -n "[0-9]+" | cargo r -qr -- -q 10946 1322 and *two orders of magnitude* for captured word characters: > echo -n "(\w+)" | cargo r -qr -- -q 605008 73786 > echo -n "([a-zA-Z0-9_]+)" | cargo r -qr -- -q 6968 3332 Bounded repetitions =================== A large amount of bounded repetitions (`{a,b}`) was added to regexes.yaml [for catastrophic backtracking migitation][7]. While this is nice for backracking based engines, it's not relevant to regex which is based on finite automata, however bounded repetitions *does* cost significantly more than unbounded repetitions: > echo -n ".+" | cargo r -qr -- -q 7784 4838 > echo -n ".{0,100}" | cargo r -qr -- -q 140624 118326 And this also compounds with the previous item when bounded repetition is used with a perl character class (although that's not very common in `regexes.yaml`, it's mostly tacked on `.`). This can be mitigated by converting "large" bounded repetitions (arbitrarily defined as an upper bound of two digits or more) to unbounded repetitions. Results ======= The end results of that work is a 22% reduction in peak memory footprint when running ua-parser over the entire sample using core's `regexes.yaml`... and even a ~4% gain in runtime despite doing more work up-front and not optimising for that[^1]. before ------ > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 9.363202625s 12µs / line 9.71 real 9.64 user 0.04 sys 254590976 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 15647 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 33 involuntary context switches 84520306010 instructions retired 31154406450 cycles elapsed 245909184 peak memory footprint after ----- > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 8.754590666s 11µs / line 9.37 real 8.95 user 0.03 sys 196198400 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 12083 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 11 voluntary context switches 40 involuntary context switches 80119011397 instructions retired 28903938853 cycles elapsed 192169408 peak memory footprint the world that almost was ------------------------- Sadly as it turns out there are a few large-ish *functional* bounded repetitions, for instance ; {0,2}(moto)(.{0,50})(?: Build|\) AppleWebKit) mis-captures if it's converted to `.*`. This means my original threshold of converting any repetition with two digits upper bound was a bust and I had to move up to 3 (there are no upper bounds above 50 but below 100). Opened ua-parser/uap-core#596 in case this could be improved with a cleaner project-supported signal. With the original two-digit versions, we reached *47%* peak memory footprint reduction and 9% runtime improvement: > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt Lines: 751580 Total time: 8.541360667s 11µs / line 8.75 real 8.70 user 0.02 sys 135331840 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 8367 page reclaims 13 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 25 involuntary context switches 78422091147 instructions retired 28079764502 cycles elapsed 130106688 peak memory footprint Fixes #2 [^1]: that surprised me but the gains seem consistent from one run to the next and we can clearly see a reduction in both cycles elapsed and instructions retired so I'll take it ¯\_(ツ)_/¯ IPC even increases slightly from 2.7 to 2.8 yipee [1]: https://github.com/google/re2/wiki/Syntax [2]: https://docs.rs/regex/latest/regex/#perl-character-classes-unicode-friendly [3]: https://www.unicode.org/reports/tr18/#Compatibility_Properties [4]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes [5]: https://docs.rs/regex/latest/regex/#ascii-character-classes [6]: https://docs.rs/regex/latest/regex/#character-classes [7]: ua-parser/uap-core@6e65445
1 parent ba267cf commit 29b9195

File tree

1 file changed

+135
-3
lines changed

1 file changed

+135
-3
lines changed

ua-parser/src/lib.rs

+135-3
Original file line numberDiff line numberDiff line change
@@ -192,7 +192,7 @@ pub mod user_agent {
192192
/// Pushes a parser into the builder, may fail if the
193193
/// [`Parser::regex`] is invalid.
194194
pub fn push(mut self, ua: Parser<'a>) -> Result<Self, super::Error> {
195-
self.builder = self.builder.push(&ua.regex)?;
195+
self.builder = self.builder.push(&super::rewrite_regex(&ua.regex))?;
196196
let r = &self.builder.regexes()[self.builder.regexes().len() - 1];
197197
// number of groups in regex, excluding implicit entire match group
198198
let groups = r.captures_len() - 1;
@@ -357,7 +357,7 @@ pub mod os {
357357
/// be parsed, or if [`Parser::os_replacement`] is missing and
358358
/// the regex has no groups.
359359
pub fn push(mut self, os: Parser<'a>) -> Result<Self, ParseError> {
360-
self.builder = self.builder.push(&os.regex)?;
360+
self.builder = self.builder.push(&super::rewrite_regex(&os.regex))?;
361361
let r = &self.builder.regexes()[self.builder.regexes().len() - 1];
362362
// number of groups in regex, excluding implicit entire match group
363363
let groups = r.captures_len() - 1;
@@ -523,7 +523,7 @@ pub mod device {
523523
/// which [`Parser::regex`] is missing.
524524
pub fn push(mut self, device: Parser<'a>) -> Result<Self, ParseError> {
525525
self.builder = self.builder.push_opt(
526-
&device.regex,
526+
&super::rewrite_regex(&device.regex),
527527
regex_filtered::Options::new()
528528
.case_insensitive(device.regex_flag == Some(Flag::IgnoreCase)),
529529
)?;
@@ -607,3 +607,135 @@ pub mod device {
607607
pub model: Option<String>,
608608
}
609609
}
610+
611+
/// Rewrites a regex's character classes to ascii and bounded
612+
/// repetitions to unbounded, the second to reduce regex memory
613+
/// requirements, and the first for both that and to better match the
614+
/// (inferred) semantics intended for ua-parser.
615+
fn rewrite_regex(re: &str) -> std::borrow::Cow<'_, str> {
616+
let mut from = 0;
617+
let mut out = String::new();
618+
619+
let mut it = re.char_indices();
620+
let mut escape = false;
621+
let mut inclass = 0;
622+
'main: while let Some((idx, c)) = it.next() {
623+
match c {
624+
'\\' if !escape => {
625+
escape = true;
626+
continue
627+
}
628+
'{' if !escape && inclass == 0 => {
629+
if idx == 0 {
630+
// we're repeating nothing, this regex is broken, bail
631+
return re.into()
632+
}
633+
// we don't need to loop, we only want to replace {0, ...} and {1, ...}
634+
let Some((_, start)) = it.next() else {
635+
continue;
636+
};
637+
if start != '0' && start != '1' {
638+
continue;
639+
}
640+
641+
if !matches!(it.next(), Some((_, ','))) {
642+
continue;
643+
}
644+
645+
let mut digits = 0;
646+
for (ri, rc) in it.by_ref() {
647+
match rc {
648+
'}' if digits > 2 => {
649+
// here idx is the index of the start of
650+
// the range and ri is the end of range
651+
out.push_str(&re[from..idx]);
652+
from = ri+1;
653+
out.push_str(if start == '0' { "*" } else { "+" });
654+
break;
655+
}
656+
c if c.is_ascii_digit() => {
657+
digits += 1;
658+
}
659+
_ => {
660+
continue 'main
661+
}
662+
}
663+
}
664+
}
665+
'[' if !escape => { inclass += 1; }
666+
']' if !escape => { inclass += 1; }
667+
// no need for special cases because regex allows nesting
668+
// character classes, whereas js or python don't \o/
669+
'd' if escape => {
670+
// idx is d so idx-1 is \\, and we want to exclude it
671+
out.push_str(&re[from..idx-1]);
672+
from = idx+1;
673+
out.push_str("[0-9]");
674+
}
675+
'D' if escape => {
676+
out.push_str(&re[from..idx-1]);
677+
from = idx+1;
678+
out.push_str("[^0-9]");
679+
}
680+
'w' if escape => {
681+
out.push_str(&re[from..idx-1]);
682+
from = idx+1;
683+
out.push_str("[A-Za-z0-9_]");
684+
}
685+
'W' if escape => {
686+
out.push_str(&re[from..idx-1]);
687+
from = idx+1;
688+
out.push_str("[^A-Za-z0-9_]");
689+
}
690+
_ => ()
691+
}
692+
escape = false;
693+
}
694+
695+
if from == 0 {
696+
re.into()
697+
} else {
698+
out.push_str(&re[from..]);
699+
out.into()
700+
}
701+
}
702+
703+
#[cfg(test)]
704+
mod test_rewrite_regex {
705+
use super::rewrite_regex as rewrite;
706+
707+
#[test]
708+
fn ignore_small_repetition() {
709+
assert_eq!(rewrite(".{0,2}x"), ".{0,2}x");
710+
assert_eq!(rewrite(".{0,}"), ".{0,}");
711+
assert_eq!(rewrite(".{1,}"), ".{1,}");
712+
}
713+
714+
#[test]
715+
fn rewrite_large_repetitions() {
716+
assert_eq!(rewrite(".{0,20}x"), ".{0,20}x");
717+
assert_eq!(rewrite("(.{0,100})"), "(.*)");
718+
assert_eq!(rewrite("(.{1,50})"), "(.{1,50})");
719+
assert_eq!(rewrite(".{1,300}x"), ".+x");
720+
}
721+
722+
#[test]
723+
fn ignore_non_repetitions() {
724+
assert_eq!(
725+
rewrite(r"\{1,2}"),
726+
r"\{1,2}",
727+
"if the opening brace is escaped it's not a repetition");
728+
assert_eq!(
729+
rewrite("[.{1,100}]"),
730+
"[.{1,100}]",
731+
"inside a set it's not a repetition"
732+
);
733+
}
734+
735+
#[test]
736+
fn rewrite_classes() {
737+
assert_eq!(rewrite(r"\dx"), "[0-9]x");
738+
assert_eq!(rewrite(r"\wx"), "[A-Za-z0-9_]x");
739+
assert_eq!(rewrite(r"[\d]x"), r"[[0-9]]x");
740+
}
741+
}

0 commit comments

Comments
 (0)