Skip to content

Serialization of compiled regular expressions #258

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DemiMarie opened this issue Jun 21, 2016 · 20 comments
Closed

Serialization of compiled regular expressions #258

DemiMarie opened this issue Jun 21, 2016 · 20 comments

Comments

@DemiMarie
Copy link

Since compiling a regular expression is slow, it would be nice (especially for applications that use lots of them) to be able to serialize a compiled regex to a byte string, which can then be embedded in the application or stored in a resource. Upon startup, the compiled regex can be be loaded. The regex crate would validate the regex prior to use.

@BurntSushi
Copy link
Member

I agree that'd be nice, but it remains to be seen whether this would have a significant performance improvement or not. For example, deserializing whatever format the regex is persisted in could in theory be just as slow. What would be nice is if the regex could be used directly from that binary format, but that seems non-trivial.

This is definitely worth exploring.

@BurntSushi
Copy link
Member

BurntSushi commented Dec 29, 2016

If someone wanted to work on this, I'd be willing to mentor it. It will take a bit of work. Essentially, we'll need to serialize the entire ExecReadOnly structure, which contains 3 compiled regex programs, the original regex strings, the match type and a literal searcher. The regex programs themselves are a bit beefy but mostly just plain old data. The literal searcher is a bit more complex, since it can contain things like an Aho-Corasick automaton or a Teddy matcher. That means you'd either have to A) implement a serialization technique for both of those matches or B) just re-build them when deserializing the regex. (B) sounds like a fine place to start. In particular, building the literal matchers is probably very fast, so whether (A) is a win or not actually isn't clear.

The first thing to do is to figure out the serialization format. I would be somewhat opposed to bringing in another crate for this, which makes this task artificially harder. One possible alternative would be to put this behind a feature and make it an optional dependency.

@glapul
Copy link

glapul commented Jan 24, 2017

That's a very newbie question, but with the online DFA setup does the compilation stage really take significant time compared to using the compiled regex the first few times?

@BurntSushi
Copy link
Member

@glapul Nope. :-) When I speak of "compilation" I'm specifically referring to the point at which the regex's abstract syntax tree is translated to a sequence of instructions. This sequence of instructions represents an NFA, but its representation makes it possible to execute in a VM or even via a backtracking search. The representation is also what's used to build the DFA. In other words, the lazy DFA isn't a transformation from AST to DFA, but rather, a transformation from NFA to DFA.

Finally, even if we could somehow go from AST to DFA, we still need the NFA because the DFA can't answer every question supported by the regex API, such as returning sub-capture locations.

FWIW, compile times are most directly affected by the size of the regex. For example, a regex like [0-9]+ is tiny but a regex like \d+ is quite a bit larger. Why? Unicode.

@RReverser
Copy link
Contributor

@BurntSushi While it's understandable that saving/restoring compiled regular expressions is a hard issue and requires thorough design, I wonder if it would be possible to add at least string-based serde::{Serialize, Deserialize} implementations to Regex structure? I store regex instances as part of bigger structures and it could be quite helpful to have these implemented on Regex itself so that consumer could just derive(Serialize, Deserialize) any outer structs without hacks or worrying about how exactly such (de)serialization is implemented internally (as it still leaves opportunity to swap implementations to compiled regexes in the future).

@BurntSushi
Copy link
Member

@RReverser It's an interesting point. I've done it myself on occasion:

#[derive(Clone, Debug)]
pub struct Regex(regex::Regex);

impl ops::Deref for Regex {
    type Target = regex::Regex;
    fn deref(&self) -> &regex::Regex { &self.0 }
}

impl<'de> serde::Deserialize<'de> for Regex {
    fn deserialize<D>(de: D) -> Result<Regex, D::Error>
    where D: serde::Deserializer<'de>
    {
        use serde::de::{Error, Visitor};

        struct RegexVisitor;

        impl<'de> Visitor<'de> for RegexVisitor {
            type Value = Regex;

            fn expecting(&self, f: &mut fmt::Formatter) -> fmt::Result {
                f.write_str("a regular expression pattern")
            }

            fn visit_str<E: Error>(self, v: &str) -> Result<Regex, E> {
                regex::Regex::new(v).map(Regex).map_err(|err| {
                    E::custom(err.to_string())
                })
            }
        }

        de.deserialize_str(RegexVisitor)
    }
}

I'm not sure if I want to commit to this in the regex API though. Perhaps it could start in a separate crate?

@RReverser
Copy link
Contributor

@BurntSushi Yeah I've done this myself, but newtypes, while widely used as a workaround for such situations in Rust, are still breaking API changes each time you need to introduce them, so if possible to gain support for (de)serializations in base crate, it would be much nicer.

@BurntSushi
Copy link
Member

@RReverser I mean, sure, as a general principle, but I don't see how that applies here. Firstly, one generally doesn't expose a Regex in a public API. Second, having a separate stable crate would solve that issue anyway.

Perhaps you elaborate more specifically on your concrete use case?

@RReverser
Copy link
Contributor

RReverser commented Jan 12, 2018

My use case is parsing/interpreting ASTs where regex is just one of possible values, and, while Regex instance is not exposed directly to the user, it's stored as part of the AST for faster evaluation, and AST itself might be serialized/deserialized to the disk instead of original source.

Sure, newtype is sufficient for this usecase, I'm just saying I usually see it as a last resort since, as soon as you need to implement several traits, nesting newtypes from own and outer crates gets quite dirty.

@RReverser
Copy link
Contributor

RReverser commented Jan 12, 2018

Although if you add Serialize to it and publish as a separate crate, that would be better than nothing :) Just thought I'd suggest adding such support natively.

@RReverser
Copy link
Contributor

Another interesting option could be adding these traits to https://doc.rust-lang.org/regex/regex_syntax/enum.Expr.html and allowing to construct regex out of such Expr, but, as far as I understand, you don't want to make it a public API yet?

@BurntSushi
Copy link
Member

@RReverser The regex crate will never ever have a public dependency on regex-syntax. :-) regex-syntax is basically an internal implementation detail of regexes, and it must be allowed to evolve independently of regex.

Stated differently, the interface boundary is the concrete syntax of a regular expression.

@RReverser
Copy link
Contributor

Hmm, okay, I just thought it's high-level enough to be exposed. Anyway, given that regex might be serialized in various representations, I tend to agree that separate crate should work in this case and would be enough to prevent most issues.

@RReverser
Copy link
Contributor

If someone wanted to work on this, I'd be willing to mentor it.

@BurntSushi Do you think this is still feasible in face of regex internal refactorings you've been planning?

@BurntSushi
Copy link
Member

Maybe after that's done. But definitely not right now.

@BurntSushi
Copy link
Member

For anyone watching this issue, I just released regex-automata, which is a lower level regex library. One of its features is the serialization and deserialization of regexes, and convenient support for this has been added to ucd-generate via its dfa and regex sub-commands.

Before getting too excited, you'll want to check out the list of differences between regex and regex-automata to see whether it's possible to use it.

@Jethril
Copy link

Jethril commented Jan 20, 2023

I don't know the internals of this crate, but would it be possible (in order to pre-compile it in the executable) to dump the memory area used by the compiled regex in an array of bytes at compile time (I mean, with a proc macro)?
Those bytes would be stored in a constant, and could be unsafely casted to a Regex instance. This would require the Regex struct not to contain pointers.

@BurntSushi
Copy link
Member

Is it possible? Yes, pretty much anything is possible.

Is it plausible? No. My estimation is that it would take several months of work for me to do it. It would also very likely increase the amount of unsafe in this crate by a non-trivial number. And it would introduce new compatibility hazards.

regex-automata provides this capacity for fully compiled DFAs today, as mentioned above. That's your best bet right now.

This would require the Regex struct not to contain pointers.

Right. There are pointers everywhere. You are very likely drastically underestimating the complexity of this crate's internals. :)

@BurntSushi
Copy link
Member

I'm going to close this out because I don't see this happening any time soon. Moreover, regex-automata 0.2 (and especially 0.3 once it's released) will support serialization of regex DFAs. DFAs won't work in every use case, but if you absolutely need serialized regexes, then it might be good enough.

The full story here really is not only a monumental amount of work, but it's not a one-time investment. By exposing a stable binary representation, it also makes internal evolution so much harder because any changes need to be reconciled with the binary format. I really just do not know if it will ever happen.

So I'm going to mark this as "not planned" for now.

@BurntSushi BurntSushi closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2023
@denizsincar29
Copy link

I suppose we should have a regex!() macro that takes in a literal regex string and expands into a regex struct already precompiled. However I didn't read the issue till the end. Correct me if it's already there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants