Skip to content

Mitigate memory consumption #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
masklinn opened this issue Jul 7, 2024 · 0 comments
Closed

Mitigate memory consumption #2

masklinn opened this issue Jul 7, 2024 · 0 comments

Comments

@masklinn
Copy link
Collaborator

masklinn commented Jul 7, 2024

As can be seen from the comparison bench (1af15c1) the peak memory footprint of regex-filtered is much higher than that of FilteredRE2. It might not be the worst issue ever as 135MiB is probably not a deal-breaker (especially with the goal of using it in ua-parser/uap-python where we can hopefully get rid of the entire Python-side regex overhead)), but it's still an unfortunate amount compared to 45MiB.

I tried doing basic "obvious" optimisations like switching over to u32 ids as in re2, but that had essentially no effect, which makes sense after some back of the enveloppe calculations: the base dataset has ~600 regex, ~2000 entries and 3000 atoms, tops1, so we're saving about 4B * 3000 = 12KiB in atom_to_entry, and maybe 20 bytes per entry in parents and regexps links if we're super optimistic for ~40KiB, that's out-the-ass but even if every entry had 600 regexps and 2000 parents (which is obviously impossible and way beyond the upper end) that'd still only be 2000 * 600 * 4 + 2000 * 2000 * 4 ~ 20MiB which falls well short of the observed difference.

After doing some comparisons between re2 and regex, it looks like a large part of the issue is likely due to compiled regexes taking a lot more memory in regex than in re2, an order of magnitude is routine.

What do? Possible avenues to investigate:

  • switch to regex::bytes and disable unicode mode, this is not documented but it seems unlikely regexes.yaml regex need unicode semantics given JS regexes have basically none (afaik neither /u nor /v alter the way the basic \d, \w or \s character classes work, those are asci-only) that returns bytes data which is not great...

  • convert bounded repetitions on the fly, they cause 10~100x increase in memory use compared to unbounded repetitions

    > echo -n ".+" | cargo r -qr -- -q
    7784 4838
    > echo -n ".{0,100}" | cargo r -qr -- -q
    140624 118326
    

    regexes.yaml has a bunch of them for redos mitigation purposes (ua-parser/uap-core@6e65445) but regex is not a backtracking engine so they make no difference, the biggest issue there is no strict equivalence, ua-parser/uap-core@6e65445 updated the unbounded repetition to anything from {0,20} to {1,300}... overall it's probably safe to drop anything with a 2-digit upper bound.

    NOTE: for simplicity can just remove the upper bound (rather than convert to * and +) as internally regex desugars * to {0,} and + to {1,}.

  • this could also be applied to the character classes by splatting them to their ASCII definition:

    > echo -n "[0-9]+" | cargo r -qr -- -q
    10946 1322
    > echo -n "\d+" | cargo r -qr -- -q
    13496 8826
    > echo -n "[a-zA-Z0-9_]" | cargo r -qr -- -q
    6104 2660
    > echo -n "\w" | cargo r -qr -- -q
    80656 57486
    

    the effect is especially sensible with capture, like the extremely common capture of a bunch of numbers as part of a version spec:

    > echo -n "([0-9]+)" | cargo r -qr -- -q
    11330 1818
    > echo -n "(\d+)" | cargo r -qr -- -q
    46328 9994
    

    Of note: this is unnecessary for whitespace which apparently doesn't suffer from this issue, whether matched positively or negatively:

    > echo -n "\s" | cargo r -qr -- -q
    7040 4482
    > echo -n "[\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]" | cargo r -qr -- -q
    7064 4642
    > echo -n "\S" | cargo r -qr -- -q
    8396 7646
    > echo -n "[^\f\n\r\t\v\u0020\u00a0\u1680\u2000-\u200a\u2028\u2029\u202f\u205f\u3000\ufeff]" | cargo r -qr -- -q
    9996 7914
    

    (all expanded character class definitions from https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes)

The latter two items would let us have our cake and eat it too, as well as reduce the explosion of atoms (and thus possibly their pruning, yielding better discriminatory power). The only drawback is we have to perform the conversion on the fly allocating a translated string from the borrow. So that makes building the extractors a bit more expensive but if it pans out it should significantly reduce memory consumption without affecting runtime costs (and possibly improving them).

Footnotes

  1. IIRC, sadly while I noted it didn't work in a local file I didn't keep detailed notes on the parameters and (lack of) effect

masklinn added a commit that referenced this issue Jul 10, 2024
Per previous commits on the rough comparisons between regex-filtered
and re2, while regex-filtered is very competitive indeed on the CPU
side it suffers from memory usage issues.

This stems from two issues:

character classes
=================

`re2` uses [ASCII-only perl character classes][1], regex uses
[full-unicode Perl character classes][2] defined in terms of [UTS#18
properties][3], this leads to *much* large state graphs for `\d` and
`\w` (`\s` seems to cause much less trouble).

While uap-core doesn't actually specify regex semantics, [Javascript
perl character classes are ASCII-only][4].

As such, a valid mitigation *for ua-parser* is to convert `\d`, `\D`,
`\w`, and `\W` to the corresponding ASCII classes (I used literal
enumerations from MDN but POSIX-style classes [would have worked
too][5]). This was way helped by regex supporting [nesting enumerated
character classes][6] as it means I don't need to special-case
expanding perl-style character classes inside enumerations.

Because capture amplifies the issue, this conversion reduces memory
consumption by between 30% for non-captured digits:

    > echo -n "\d+" | cargo r -qr -- -q
    13496 8826
    > echo -n "[0-9]+" | cargo r -qr -- -q
    10946 1322

and *two orders of magnitude* for captured word characters:

    > echo -n "(\w+)" | cargo r -qr -- -q
    605008 73786
    > echo -n "([a-zA-Z0-9_]+)" | cargo r -qr -- -q
    6968 3332

Bounded repetitions
===================

A large amount of bounded repetitions (`{a,b}`) was added to
regexes.yaml [for catastrophic backtracking migitation][7]. While this
is nice for backracking based engines, it's not relevant to regex
which is based on finite automata, however bounded repetitions *does*
cost significantly more than unbounded repetitions:

    > echo -n ".+" | cargo r -qr -- -q
    7784 4838
    > echo -n ".{0,100}" | cargo r -qr -- -q
    140624 118326

And this also compounds with the previous item when bounded repetition
is used with a perl character class (although that's not very common
in `regexes.yaml`, it's mostly tacked on `.`).

This can be mitigated by converting "large" bounded repetitions
(arbitrarily defined as an upper bound of two digits or more) to
unbounded repetitions.

Results
=======

The end results of that work is a 22% reduction in peak memory
footprint when running ua-parser over the entire sample using core's
`regexes.yaml`... and even a ~4% gain in runtime despite doing more
work up-front and not optimising for that[^1].

before
------

    > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt
    Lines: 751580
    Total time: 9.363202625s
    12µs / line
        9.71 real         9.64 user         0.04 sys
           254590976  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               15647  page reclaims
                  13  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  33  involuntary context switches
         84520306010  instructions retired
         31154406450  cycles elapsed
           245909184  peak memory footprint

after
-----

> /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt
Lines: 751580
Total time: 8.754590666s
11µs / line
        9.37 real         8.95 user         0.03 sys
           196198400  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               12083  page reclaims
                  13  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  11  voluntary context switches
                  40  involuntary context switches
         80119011397  instructions retired
         28903938853  cycles elapsed
           192169408  peak memory footprint

the world that almost was
-------------------------

Sadly as it turns out there are a few large-ish *functional* bounded
repetitions, for instance

    ; {0,2}(moto)(.{0,50})(?: Build|\) AppleWebKit)

mis-captures if it's converted to `.*`. This means my original
threshold of converting any repetition with two digits upper bound was
a bust and I had to move up to 3 (there are no upper bounds above 50
but below 100). Opened ua-parser/uap-core#596 in case this could be
improved with a cleaner project-supported signal.

With the original two-digit versions, we reached *47%* peak memory
footprint reduction and 9% runtime improvement:

    > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt
    Lines: 751580
    Total time: 8.541360667s
    11µs / line
        8.75 real         8.70 user         0.02 sys
           135331840  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                8367  page reclaims
                  13  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  25  involuntary context switches
         78422091147  instructions retired
         28079764502  cycles elapsed
           130106688  peak memory footprint

Fixes #2

[^1]: that surprised me but the gains seem consistent from one run to
    the next and we can clearly see a reduction in both cycles elapsed
    and instructions retired so I'll take it ¯\_(ツ)_/¯ IPC even
    increases slightly from 2.7 to 2.8 yipee

[1]: https://github.com/google/re2/wiki/Syntax
[2]: https://docs.rs/regex/latest/regex/#perl-character-classes-unicode-friendly
[3]: https://www.unicode.org/reports/tr18/#Compatibility_Properties
[4]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes
[5]: https://docs.rs/regex/latest/regex/#ascii-character-classes
[6]: https://docs.rs/regex/latest/regex/#character-classes
[7]: ua-parser/uap-core@6e65445
masklinn added a commit that referenced this issue Jul 15, 2024
Per previous commits on the rough comparisons between regex-filtered
and re2, while regex-filtered is very competitive indeed on the CPU
side it suffers from memory usage issues.

This stems from two issues:

character classes
=================

`re2` uses [ASCII-only perl character classes][1], regex uses
[full-unicode Perl character classes][2] defined in terms of [UTS#18
properties][3], this leads to *much* large state graphs for `\d` and
`\w` (`\s` seems to cause much less trouble).

While uap-core doesn't actually specify regex semantics, [Javascript
perl character classes are ASCII-only][4].

As such, a valid mitigation *for ua-parser* is to convert `\d`, `\D`,
`\w`, and `\W` to the corresponding ASCII classes (I used literal
enumerations from MDN but POSIX-style classes [would have worked
too][5]). This was way helped by regex supporting [nesting enumerated
character classes][6] as it means I don't need to special-case
expanding perl-style character classes inside enumerations.

Because capture amplifies the issue, this conversion reduces memory
consumption by between 30% for non-captured digits:

    > echo -n "\d+" | cargo r -qr -- -q
    13496 8826
    > echo -n "[0-9]+" | cargo r -qr -- -q
    10946 1322

and *two orders of magnitude* for captured word characters:

    > echo -n "(\w+)" | cargo r -qr -- -q
    605008 73786
    > echo -n "([a-zA-Z0-9_]+)" | cargo r -qr -- -q
    6968 3332

Bounded repetitions
===================

A large amount of bounded repetitions (`{a,b}`) was added to
regexes.yaml [for catastrophic backtracking migitation][7]. While this
is nice for backracking based engines, it's not relevant to regex
which is based on finite automata, however bounded repetitions *does*
cost significantly more than unbounded repetitions:

    > echo -n ".+" | cargo r -qr -- -q
    7784 4838
    > echo -n ".{0,100}" | cargo r -qr -- -q
    140624 118326

And this also compounds with the previous item when bounded repetition
is used with a perl character class (although that's not very common
in `regexes.yaml`, it's mostly tacked on `.`).

This can be mitigated by converting "large" bounded repetitions
(arbitrarily defined as an upper bound of two digits or more) to
unbounded repetitions.

Results
=======

The end results of that work is a 22% reduction in peak memory
footprint when running ua-parser over the entire sample using core's
`regexes.yaml`... and even a ~4% gain in runtime despite doing more
work up-front and not optimising for that[^1].

before
------

    > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt
    Lines: 751580
    Total time: 9.363202625s
    12µs / line
        9.71 real         9.64 user         0.04 sys
           254590976  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               15647  page reclaims
                  13  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  33  involuntary context switches
         84520306010  instructions retired
         31154406450  cycles elapsed
           245909184  peak memory footprint

after
-----

> /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt
Lines: 751580
Total time: 8.754590666s
11µs / line
        9.37 real         8.95 user         0.03 sys
           196198400  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
               12083  page reclaims
                  13  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                  11  voluntary context switches
                  40  involuntary context switches
         80119011397  instructions retired
         28903938853  cycles elapsed
           192169408  peak memory footprint

the world that almost was
-------------------------

Sadly as it turns out there are a few large-ish *functional* bounded
repetitions, for instance

    ; {0,2}(moto)(.{0,50})(?: Build|\) AppleWebKit)

mis-captures if it's converted to `.*`. This means my original
threshold of converting any repetition with two digits upper bound was
a bust and I had to move up to 3 (there are no upper bounds above 50
but below 100). Opened ua-parser/uap-core#596 in case this could be
improved with a cleaner project-supported signal.

With the original two-digit versions, we reached *47%* peak memory
footprint reduction and 9% runtime improvement:

    > /usr/bin/time -l ../target/release/examples/bench -r 10 ~/sources/thirdparty/uap-core/regexes.yaml ~/sources/thirdparty/uap-python/samples/useragents.txt
    Lines: 751580
    Total time: 8.541360667s
    11µs / line
        8.75 real         8.70 user         0.02 sys
           135331840  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                8367  page reclaims
                  13  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  25  involuntary context switches
         78422091147  instructions retired
         28079764502  cycles elapsed
           130106688  peak memory footprint

Fixes #2

[^1]: that surprised me but the gains seem consistent from one run to
    the next and we can clearly see a reduction in both cycles elapsed
    and instructions retired so I'll take it ¯\_(ツ)_/¯ IPC even
    increases slightly from 2.7 to 2.8 yipee

[1]: https://github.com/google/re2/wiki/Syntax
[2]: https://docs.rs/regex/latest/regex/#perl-character-classes-unicode-friendly
[3]: https://www.unicode.org/reports/tr18/#Compatibility_Properties
[4]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions/Character_classes
[5]: https://docs.rs/regex/latest/regex/#ascii-character-classes
[6]: https://docs.rs/regex/latest/regex/#character-classes
[7]: ua-parser/uap-core@6e65445
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant