-
Notifications
You must be signed in to change notification settings - Fork 20
ACP: String-splitting iterators with indices #222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This runs into the same issue as filling out the str API with things like This is useful however and could easily be built on top of my proposal for a string splitting builder API: #168 |
Is this the best/simplest API for this? What about e.g. a |
This seems like a large number of additional APIs for something that could have a few different alternatives. This seems like it could be done by checking the offset of the |
I like this idea. |
I personally like @pitaj's idea of using a builder API for this. Honestly, despite the builder pattern being incredibly widespread across the Rust ecosystem, we don't see it too much in the standard library. I'm going to close this in favour of that proposal. |
Proposal
Problem statement
When using the
split
and similar iterators for strings, only the string slices are returned, without any indication of where they can be found in the original string. This is useful for processing the lines individually, but prevents later processing of the entire string based upon these lines.Motivation, use-cases
Sentinel lines
Some file formats use sentinel lines to delimit the specific sections, like pacman's "desc" format:
In these cases, it seems most useful to be able to iterate via
lines
, search for the sentinel sections, then later pass these individual sections to different parts of processing. While this could be accomplished by using a state machine inside the iterator, this would likely be more complicated to read and potentially less performant.Isolating lines/spans
Imagine an iterator that parses the lines of a string into some other type, where the error returns the line that failed. If the type is parsed from a string slice, then this isn't a problem, since the error can have the same lifetime as the original slice. However, if the type being parsed is an owned string, then the specific line can't be returned directly, and instead has to be allocated into a separate string to be returned. If indices to the line were provided, then these could simply be saved and the original string buffer could be truncated to just the line, saving on allocations.
Additionally, beyond this (relatively contrived) example, indices are also useful for expanding failed lines into a multi-line span, like Rust does with its error messages. If the position of the line is obtained, then the string can be scanned backward and forward some number of lines back, and those indices can be used instead to slice a larger span. Without this, the caller would have to keep a buffer of some number of lines back, then later concatenate them into a larger string on error, which seems like way more work.
Note on workaround
It is possible to create this today by accumulating the total length of lines passed to the iterator to ultimately create an equivalent position per line, although this only works with
split_inclusive
sincesplit
delimiters can have variable length. Whilelines
usessplit_inclusive
, this means that thelines
case is possible to replicate, but it feels like the cases are common enough to allow a dedicated method.There is also the alternative
match_indices
which will provide the exact indices necessary for not replicating too much work on the user's end, but again, it feels like the benefit outweighs the cost here.Solution sketches
While the main desire here is for a
line_indices
iterator, many splitting iterators all use an internalSplitInternal
iterator, and it feels reasonable to create a with-indices version of this, then augment all of the related methods to have a with-indices version. This would look something like:And all of the iterators would return
(usize, &str)
items instead of just&str
, to mirrorchar_indices
which returns(usize, char)
.This could potentially be extended to the
splitn
variants, which have a separateSplitNInternal
iterator:Links and related work
N/A for now
What happens now?
This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
The text was updated successfully, but these errors were encountered: