Skip to content

Add regex sets. #173

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 22, 2016
Merged

Add regex sets. #173

merged 1 commit into from
Feb 22, 2016

Conversation

BurntSushi
Copy link
Member

Regex sets permit matching multiple (possibly overlapping) regular
expressions in a single scan of the search text. This adds a few new
types, with RegexSet being the primary one.

All matching engines support regex sets, including the lazy DFA.

This commit also refactors a lot of the code around handling captures
into a central Search, which now also includes a set of matches that
is used by regex sets to determine which regex has matched.

We also merged the Program and Insts type, which were split up when
adding the lazy DFA, but the code seemed more complicated because of it.

Closes #156.

@BurntSushi
Copy link
Member Author

@alexcrichton This PR includes new additions to the API, but no breaking changes. The additions are completely separate from the primary Regex API, and can be found in src/set.rs. I tried to be conservative when possible. For example, it might be better to have a distinct RegexSetBuilder type, so that callers can incrementally add regexes (where some my fail to parse), but I think that could be added later relatively easily.

Regex sets permit matching multiple (possibly overlapping) regular
expressions in a single scan of the search text. This adds a few new
types, with `RegexSet` being the primary one.

All matching engines support regex sets, including the lazy DFA.

This commit also refactors a lot of the code around handling captures
into a central `Search`, which now also includes a set of matches that
is used by regex sets to determine which regex has matched.

We also merged the `Program` and `Insts` type, which were split up when
adding the lazy DFA, but the code seemed more complicated because of it.

Closes #156.
@alexcrichton
Copy link
Member

Nice! I couldn't immediately come up with a use case for these, but sounds plausible to me!

@BurntSushi
Copy link
Member Author

URL router, user agent matcher. Generally any time you have lots of patterns you need to match. RE2 has it. :)

@alexcrichton
Copy link
Member

Sounds like a plan to me

BurntSushi added a commit that referenced this pull request Feb 22, 2016
@BurntSushi BurntSushi merged commit 10e9501 into master Feb 22, 2016
/// alternate can match at a time.
///
/// For example, consider regular expressions to match email addresses and
/// domains: `[a-z]+@[a-z]+\.(com|org|net)` and `[a-z]+\.(com|org|net)`. If a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would make sense to use a different example here? It seems like there are already plenty of incomplete/inaccurate regexes around the internet the purport to match e-mail addresses without adding more. It seems like the docs should at least call out the fact that these are grossly simplified and will fail to match many valid e-mail addresses, so that nobody copy-pastes them without thinking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to indulge alternative examples.

@BurntSushi BurntSushi deleted the multi2 branch April 25, 2016 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants