-
Notifications
You must be signed in to change notification settings - Fork 461
Serialization of compiled regular expressions #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree that'd be nice, but it remains to be seen whether this would have a significant performance improvement or not. For example, deserializing whatever format the regex is persisted in could in theory be just as slow. What would be nice is if the regex could be used directly from that binary format, but that seems non-trivial. This is definitely worth exploring. |
If someone wanted to work on this, I'd be willing to mentor it. It will take a bit of work. Essentially, we'll need to serialize the entire The first thing to do is to figure out the serialization format. I would be somewhat opposed to bringing in another crate for this, which makes this task artificially harder. One possible alternative would be to put this behind a feature and make it an optional dependency. |
That's a very newbie question, but with the online DFA setup does the compilation stage really take significant time compared to using the compiled regex the first few times? |
@glapul Nope. :-) When I speak of "compilation" I'm specifically referring to the point at which the regex's abstract syntax tree is translated to a sequence of instructions. This sequence of instructions represents an NFA, but its representation makes it possible to execute in a VM or even via a backtracking search. The representation is also what's used to build the DFA. In other words, the lazy DFA isn't a transformation from AST to DFA, but rather, a transformation from NFA to DFA. Finally, even if we could somehow go from AST to DFA, we still need the NFA because the DFA can't answer every question supported by the regex API, such as returning sub-capture locations. FWIW, compile times are most directly affected by the size of the regex. For example, a regex like |
@BurntSushi While it's understandable that saving/restoring compiled regular expressions is a hard issue and requires thorough design, I wonder if it would be possible to add at least string-based |
@RReverser It's an interesting point. I've done it myself on occasion: #[derive(Clone, Debug)]
pub struct Regex(regex::Regex);
impl ops::Deref for Regex {
type Target = regex::Regex;
fn deref(&self) -> ®ex::Regex { &self.0 }
}
impl<'de> serde::Deserialize<'de> for Regex {
fn deserialize<D>(de: D) -> Result<Regex, D::Error>
where D: serde::Deserializer<'de>
{
use serde::de::{Error, Visitor};
struct RegexVisitor;
impl<'de> Visitor<'de> for RegexVisitor {
type Value = Regex;
fn expecting(&self, f: &mut fmt::Formatter) -> fmt::Result {
f.write_str("a regular expression pattern")
}
fn visit_str<E: Error>(self, v: &str) -> Result<Regex, E> {
regex::Regex::new(v).map(Regex).map_err(|err| {
E::custom(err.to_string())
})
}
}
de.deserialize_str(RegexVisitor)
}
} I'm not sure if I want to commit to this in the regex API though. Perhaps it could start in a separate crate? |
@BurntSushi Yeah I've done this myself, but newtypes, while widely used as a workaround for such situations in Rust, are still breaking API changes each time you need to introduce them, so if possible to gain support for (de)serializations in base crate, it would be much nicer. |
@RReverser I mean, sure, as a general principle, but I don't see how that applies here. Firstly, one generally doesn't expose a Perhaps you elaborate more specifically on your concrete use case? |
My use case is parsing/interpreting ASTs where regex is just one of possible values, and, while Regex instance is not exposed directly to the user, it's stored as part of the AST for faster evaluation, and AST itself might be serialized/deserialized to the disk instead of original source. Sure, newtype is sufficient for this usecase, I'm just saying I usually see it as a last resort since, as soon as you need to implement several traits, nesting newtypes from own and outer crates gets quite dirty. |
Although if you add |
Another interesting option could be adding these traits to https://doc.rust-lang.org/regex/regex_syntax/enum.Expr.html and allowing to construct regex out of such |
@RReverser The Stated differently, the interface boundary is the concrete syntax of a regular expression. |
Hmm, okay, I just thought it's high-level enough to be exposed. Anyway, given that regex might be serialized in various representations, I tend to agree that separate crate should work in this case and would be enough to prevent most issues. |
@BurntSushi Do you think this is still feasible in face of regex internal refactorings you've been planning? |
Maybe after that's done. But definitely not right now. |
For anyone watching this issue, I just released Before getting too excited, you'll want to check out the list of differences between |
I don't know the internals of this crate, but would it be possible (in order to pre-compile it in the executable) to dump the memory area used by the compiled regex in an array of bytes at compile time (I mean, with a proc macro)? |
Is it possible? Yes, pretty much anything is possible. Is it plausible? No. My estimation is that it would take several months of work for me to do it. It would also very likely increase the amount of
Right. There are pointers everywhere. You are very likely drastically underestimating the complexity of this crate's internals. :) |
I'm going to close this out because I don't see this happening any time soon. Moreover, The full story here really is not only a monumental amount of work, but it's not a one-time investment. By exposing a stable binary representation, it also makes internal evolution so much harder because any changes need to be reconciled with the binary format. I really just do not know if it will ever happen. So I'm going to mark this as "not planned" for now. |
I suppose we should have a regex!() macro that takes in a literal regex string and expands into a regex struct already precompiled. However I didn't read the issue till the end. Correct me if it's already there. |
Since compiling a regular expression is slow, it would be nice (especially for applications that use lots of them) to be able to serialize a compiled regex to a byte string, which can then be embedded in the application or stored in a resource. Upon startup, the compiled regex can be be loaded. The
regex
crate would validate the regex prior to use.The text was updated successfully, but these errors were encountered: