Skip to content

Commit 4f1c7df

Browse files
committed
Reach parity with re2 by being less of an idiot
As hinted to in the previous commit, I completely missed the significance of using IntMap = SparseArray<int>; the first time around and just used whatever worked, but turns out that was the entire source of my trouble: `IntMap` is just two array lookups (and a few checks) to find an entry, not an entire siphash hash, that is a ton cheaper. Eventually we might want to implement a specialised version for our exact needs (as we're still going through an entire hashing and entry lookup machinery we don't need to), but for now get somewhat close using an `IndexSet<NoHashHasher>`: `IndexSet` provides the ordering and fast iteration, and NoHash provides a *nearly* free hashing, and since the inputs are the output of aho and a few other processes it's unlikely to be very adversarial, at least in a problematic way. Also as noted previously, don't even bother for the counts, just use a vec as that's the only bit that's needed. Tried to use a stack-allocated TinyVec but that turns out to consistently be ~5% slower, with ~7% more cycles needed (though only 2% more instructions). Here's the `time(1)` report after this change: ``` 46.63 real 46.25 user 0.03 sys 143327232 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 8884 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 131 involuntary context switches 498048446439 instructions retired 149096223231 cycles elapsed 139855488 peak memory footprint ``` Interestingly while the total runtime is about on par with re2 (even more so if we consider that the rust script takes ~0.1s on loading and compiling the prefilters) rust-regex now retires 10% less instructions and burns 1.5% less cycles than re2. So it likely still has memory access issues, most likely from misusing `IndexSet` Also it's improved from 3.14 IPC to 3.34, which is great, but the re2 script is at 3.67. In all honesty I don't know where that would come from, I assume from having more expensive instructions like modulos for the hashset. At least it doesn't look like the branches are too much of an issue... Also I tried looking at using `get_unchecked` for the counts in case that did something, it does not (at least not consistently), so probably not something I need to investigate in the future even if I will undoubtedly forget.
1 parent 1af15c1 commit 4f1c7df

File tree

2 files changed

+24
-18
lines changed

2 files changed

+24
-18
lines changed

Diff for: regex-filtered/Cargo.toml

+2
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,12 @@ repository = "https://github.com/ua-parser/uap-rust/"
1616
aho-corasick = "1.1.3"
1717
indexmap = "2.2.6"
1818
itertools = "0.13.0"
19+
nohash = "0.2.0"
1920
regex = "1.10.4"
2021
regex-syntax = "0.8.3"
2122

2223
[dev-dependencies]
24+
clap = { version = "4.5.7", features = ["derive"] }
2325
criterion = "0.5.1"
2426

2527
[[bench]]

Diff for: regex-filtered/src/mapper.rs

+22-18
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
use std::collections::{HashMap, HashSet};
21
use std::fmt::Display;
32
use std::fmt::Formatter;
43

@@ -192,7 +191,7 @@ impl Display for Mapper {
192191
writeln!(f, "#Unique Atoms: {}", self.atom_to_entry.len())?;
193192
for (i, e) in self.atom_to_entry.iter().copied().enumerate() {
194193
writeln!(f, "\tatom {i} -> entry {e}")?;
195-
for r in self.propagate_match([e].into()) {
194+
for r in self.propagate_match(&mut FromIterator::from_iter([e])) {
196195
writeln!(f, "\t\tregex {r}")?;
197196
}
198197
}
@@ -236,6 +235,8 @@ struct Entry {
236235
/// regexps that are triggered.
237236
regexps: Vec<usize>,
238237
}
238+
239+
type Set = IndexSet<usize, nohash::BuildNoHashHasher<usize>>;
239240
pub struct Mapper {
240241
/// Number of regexes covered by the mapper
241242
regexp_count: usize,
@@ -250,26 +251,28 @@ pub struct Mapper {
250251
}
251252
impl Mapper {
252253
// name is shit and also needs to see if we can generate stuff on the fly
253-
pub fn atom_to_re(&self, atoms: impl IntoIterator<Item = usize>) -> Vec<usize> {
254-
let matched_atom_ids = atoms
255-
.into_iter()
256-
.map(|idx| self.atom_to_entry[idx])
257-
.collect();
258-
let regexps_map = self.propagate_match(matched_atom_ids);
259-
260-
let mut regexps = Vec::with_capacity(regexps_map.len() + self.unfiltered.len());
254+
pub fn atom_to_re(&self, atoms: impl IntoIterator<Item = usize>) -> Set {
255+
let mut matched_atom_ids = IndexSet::with_capacity_and_hasher(
256+
self.entries.len(),
257+
nohash::BuildNoHashHasher::default(),
258+
);
259+
matched_atom_ids.extend(atoms.into_iter().map(|idx| self.atom_to_entry[idx]));
260+
261+
let mut regexps = self.propagate_match(&mut matched_atom_ids);
262+
261263
regexps.extend(&self.unfiltered);
262-
regexps.extend(regexps_map);
263264

264265
regexps.sort_unstable();
265266
regexps
266267
}
267268

268-
fn propagate_match(&self, mut work: IndexSet<usize>) -> HashSet<usize> {
269-
work.reserve(self.entries.len() - work.len());
270-
let mut count = HashMap::with_capacity(self.entries.len());
269+
fn propagate_match(&self, work: &mut Set) -> Set {
270+
let mut count = vec![0;self.entries.len()];
271271

272-
let mut regexps = HashSet::with_capacity(self.regexp_count);
272+
let mut regexps = IndexSet::with_capacity_and_hasher(
273+
self.regexp_count,
274+
nohash::BuildNoHashHasher::default(),
275+
);
273276

274277
let mut i = 0;
275278
while i < work.len() {
@@ -284,7 +287,8 @@ impl Mapper {
284287
let parent = &self.entries[j];
285288
// Delay until all the children have succeeded.
286289
if parent.propagate_up_at_count > 1 {
287-
let c = count.entry(j).and_modify(|e| *e += 1).or_insert(1);
290+
let c = &mut count[j];
291+
*c += 1;
288292
if *c < parent.propagate_up_at_count {
289293
continue;
290294
}
@@ -337,8 +341,8 @@ mod test {
337341

338342
assert_eq!(m.entries.len(), 3);
339343
assert_eq!(&m.atom_to_entry, &[0, 1]);
340-
assert_eq!(m.propagate_match([0].into()), [0].into(),);
341-
assert_eq!(m.propagate_match([1].into()), [0].into(),);
344+
assert_eq!(m.propagate_match(&mut FromIterator::from_iter([0])), [0].into(),);
345+
assert_eq!(m.propagate_match(&mut FromIterator::from_iter([1])), [0].into(),);
342346
}
343347

344348
fn check_patterns(patterns: &'static [&'static str], expected: &'static [&'static str]) {

0 commit comments

Comments
 (0)