Skip to content

[5.7] Fix anchor bugs, de-genericize processor, add ranges collection #531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 1, 2022

Conversation

milseman
Copy link
Member

No description provided.

natecook1000 and others added 4 commits June 30, 2022 11:32
^ and $ should match the start and end of the callee, even if that
callee is a substring. Right now ^ and $ match the start and end of
the callee's base string, instead. In addition, ^ and $ should only
match the start and end of the callee when replacing a subrange, not
the start and end of the subrange.
This prepares for adopting an opaque result type for matches(of:)
and ranges(of:). The old, CollectionConsumer-based model moves 
index-by-index, and isn't aware of the regex's semantic level, 
which results in inaccurate results for regexes that match at a 
mid-character index.
* Avoid double execution by avoiding Array init

* De-genericize processor, engine, etc.

Provides only modest performance improvements (it was already getting
specialized), but makes it possible to add String-specific specializations.
@milseman
Copy link
Member Author

@swift-ci please test

milseman and others added 3 commits June 30, 2022 11:43
* Allow CustomConsuming types to match w/ zero width

We previously asserted if a custom consuming type matches with zero
width, but that isn't necessary or good. A custom type can implement
a lookaround assertion or act as a tracer.

* Rename Processor.advance(to:) to resume(at:)

Since the given index doesn’t need to advance, this name is less
misleading.
This separates the two different ideas for boundaries in
the base input:

- subjectBounds: These represent the actual subject in the input
  string. For a `String` callee, this will cover the entire bounds,
  while for a `Substring` these will represent the bounds of the
  substring in the base.
- searchBounds: These represent the current search range in the
  subject. These bounds can be the same as `subjectBounds` or a
  subrange when searching for subsequent matches or replacing only
  in a subrange of a string.

* firstMatch shouldn't update searchBounds on iteration

When we move forward while searching for the first match, the search
bounds should stay the same. Only the currentPosition needs to move
forward. This will allow us to implement the \G start of match anchor,
with which /\Gab/ matches "abab" twice, compared with /^ab/, which
only matches once.

* Make matches(of:) and ranges(of:) boundary-aware

With this change, RegexMatchesCollection keeps the subject bounds
and search bounds separately, modifying the search bounds with each
iteration. In addition, the replace methods that only operate on a
subrange can specify that specifically, getting the correct anchor
behavior while only matching within a portion of a string.
@milseman milseman changed the title [5.7] De-genericize processor and add ranges collection [5.7] Fix anchor bugs, de-genericize processor, add ranges collection Jun 30, 2022
@milseman
Copy link
Member Author

@swift-ci please test

@@ -145,6 +145,10 @@ extension Regex where Output == AnyRegexOutput {
public init(_ pattern: String) throws {
self.init(ast: try parse(pattern, .semantic, .traditional))
}

internal init(_ pattern: String, syntax: SyntaxOptions) throws {
self.init(ast: try parse(pattern, .semantic, syntax))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this will need changing to drop the .semantic now that #519 has landed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't that pick up the change then?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because of the ordering of the changes, the parser recovery PR was written after this change, but cherry-picked before it. I resolved the conflict in my cherry-pick, as I wasn't sure whether this was going to be cherry-picked or not

@milseman milseman merged commit 7f5bffd into swiftlang:swift/release/5.7 Jul 1, 2022
@milseman milseman deleted the 5_7_degenericize branch July 1, 2022 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants