Skip to content

Commit 915a154

Browse files
committed
doc: explain the new word boundary assertions
Closes #469
1 parent 97f0205 commit 915a154

File tree

2 files changed

+47
-30
lines changed

2 files changed

+47
-30
lines changed

CHANGELOG.md

+7
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,16 @@ TBD
33

44
New features:
55

6+
* [FEATURE #469](https://github.com/rust-lang/regex/issues/469):
7+
Add support for `\<` and `\>` word boundary assertions.
68
* [FEATURE(regex-automata) #1031](https://github.com/rust-lang/regex/pull/1031):
79
DFAs now have a `start_state` method that doesn't use an `Input`.
810

11+
Performance improvements:
12+
13+
* [PERF #1051](https://github.com/rust-lang/regex/pull/1051):
14+
Unicode character class operations have been optimized in `regex-syntax`.
15+
916
Bug fixes:
1017

1118
* [BUG #1046](https://github.com/rust-lang/regex/issues/1046):

src/lib.rs

+40-30
Original file line numberDiff line numberDiff line change
@@ -543,8 +543,10 @@ scalar value, even when it is encoded using multiple bytes. When Unicode mode
543543
is disabled (e.g., `(?-u:.)`), then `.` will match a single byte in all cases.
544544
* The character classes `\w`, `\d` and `\s` are all Unicode-aware by default.
545545
Use `(?-u:\w)`, `(?-u:\d)` and `(?-u:\s)` to get their ASCII-only definitions.
546-
* Similarly, `\b` and `\B` use a Unicode definition of a "word" character. To
547-
get ASCII-only word boundaries, use `(?-u:\b)` and `(?-u:\B)`.
546+
* Similarly, `\b` and `\B` use a Unicode definition of a "word" character.
547+
To get ASCII-only word boundaries, use `(?-u:\b)` and `(?-u:\B)`. This also
548+
applies to the special word boundary assertions. (That is, `\b{start}`,
549+
`\b{end}`, `\b{start-half}`, `\b{end-half}`.)
548550
* `^` and `$` are **not** Unicode-aware in multi-line mode. Namely, they only
549551
recognize `\n` (assuming CRLF mode is not enabled) and not any of the other
550552
forms of line terminators defined by Unicode.
@@ -723,12 +725,16 @@ x{n}? exactly n x
723725
### Empty matches
724726
725727
<pre class="rust">
726-
^ the beginning of a haystack (or start-of-line with multi-line mode)
727-
$ the end of a haystack (or end-of-line with multi-line mode)
728-
\A only the beginning of a haystack (even with multi-line mode enabled)
729-
\z only the end of a haystack (even with multi-line mode enabled)
730-
\b a Unicode word boundary (\w on one side and \W, \A, or \z on other)
731-
\B not a Unicode word boundary
728+
^ the beginning of a haystack (or start-of-line with multi-line mode)
729+
$ the end of a haystack (or end-of-line with multi-line mode)
730+
\A only the beginning of a haystack (even with multi-line mode enabled)
731+
\z only the end of a haystack (even with multi-line mode enabled)
732+
\b a Unicode word boundary (\w on one side and \W, \A, or \z on other)
733+
\B not a Unicode word boundary
734+
\b{start}, \< a Unicode start-of-word boundary (\W|\A on the left, \w on the right)
735+
\b{end}, \> a Unicode end-of-word boundary (\w on the left, \W|\z on the right))
736+
\b{start-half} half of a Unicode start-of-word boundary (\W|\A on the left)
737+
\b{end-half} half of a Unicode end-of-word boundary (\W|\z on the right)
732738
</pre>
733739
734740
The empty regex is valid and matches the empty string. For example, the
@@ -856,28 +862,32 @@ Note that this includes all possible escape sequences, even ones that are
856862
documented elsewhere.
857863
858864
<pre class="rust">
859-
\* literal *, applies to all ASCII except [0-9A-Za-z<>]
860-
\a bell (\x07)
861-
\f form feed (\x0C)
862-
\t horizontal tab
863-
\n new line
864-
\r carriage return
865-
\v vertical tab (\x0B)
866-
\A matches at the beginning of a haystack
867-
\z matches at the end of a haystack
868-
\b word boundary assertion
869-
\B negated word boundary assertion
870-
\123 octal character code, up to three digits (when enabled)
871-
\x7F hex character code (exactly two digits)
872-
\x{10FFFF} any hex character code corresponding to a Unicode code point
873-
\u007F hex character code (exactly four digits)
874-
\u{7F} any hex character code corresponding to a Unicode code point
875-
\U0000007F hex character code (exactly eight digits)
876-
\U{7F} any hex character code corresponding to a Unicode code point
877-
\p{Letter} Unicode character class
878-
\P{Letter} negated Unicode character class
879-
\d, \s, \w Perl character class
880-
\D, \S, \W negated Perl character class
865+
\* literal *, applies to all ASCII except [0-9A-Za-z<>]
866+
\a bell (\x07)
867+
\f form feed (\x0C)
868+
\t horizontal tab
869+
\n new line
870+
\r carriage return
871+
\v vertical tab (\x0B)
872+
\A matches at the beginning of a haystack
873+
\z matches at the end of a haystack
874+
\b word boundary assertion
875+
\B negated word boundary assertion
876+
\b{start}, \< start-of-word boundary assertion
877+
\b{end}, \> end-of-word boundary assertion
878+
\b{start-half} half of a start-of-word boundary assertion
879+
\b{end-half} half of a end-of-word boundary assertion
880+
\123 octal character code, up to three digits (when enabled)
881+
\x7F hex character code (exactly two digits)
882+
\x{10FFFF} any hex character code corresponding to a Unicode code point
883+
\u007F hex character code (exactly four digits)
884+
\u{7F} any hex character code corresponding to a Unicode code point
885+
\U0000007F hex character code (exactly eight digits)
886+
\U{7F} any hex character code corresponding to a Unicode code point
887+
\p{Letter} Unicode character class
888+
\P{Letter} negated Unicode character class
889+
\d, \s, \w Perl character class
890+
\D, \S, \W negated Perl character class
881891
</pre>
882892
883893
### Perl character classes (Unicode friendly)

0 commit comments

Comments
 (0)