Skip to content

ACP: String-splitting iterators with indices #222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
clarfonthey opened this issue May 15, 2023 · 5 comments
Closed

ACP: String-splitting iterators with indices #222

clarfonthey opened this issue May 15, 2023 · 5 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@clarfonthey
Copy link

clarfonthey commented May 15, 2023

Proposal

Problem statement

When using the split and similar iterators for strings, only the string slices are returned, without any indication of where they can be found in the original string. This is useful for processing the lines individually, but prevents later processing of the entire string based upon these lines.

Motivation, use-cases

Sentinel lines

Some file formats use sentinel lines to delimit the specific sections, like pacman's "desc" format:

%NAME%
rustup

%VERSION%
1.25.2-1

%BASE%
rustup

%DESC%
The Rust toolchain installer

%URL%
https://github.com/rust-lang/rustup.rs

%ARCH%
x86_64

%BUILDDATE%
1675514879

%INSTALLDATE%
1676666348

%PACKAGER%
Orhun Parmaksız <[email protected]>

%SIZE%
8688993

%LICENSE%
MIT
Apache

%VALIDATION%
pgp

%REPLACES%
cargo-tree

%DEPENDS%
curl
xz
zstd

%OPTDEPENDS%
lldb: rust-lldb script
gdb: rust-gdb script

%CONFLICTS%
rust
cargo
rustfmt

%PROVIDES%
rust
cargo
rust-nightly
cargo-nightly
rustfmt
rust-src
lib32-rust-libs
rust-musl
rust-wasm

In these cases, it seems most useful to be able to iterate via lines, search for the sentinel sections, then later pass these individual sections to different parts of processing. While this could be accomplished by using a state machine inside the iterator, this would likely be more complicated to read and potentially less performant.

Isolating lines/spans

Imagine an iterator that parses the lines of a string into some other type, where the error returns the line that failed. If the type is parsed from a string slice, then this isn't a problem, since the error can have the same lifetime as the original slice. However, if the type being parsed is an owned string, then the specific line can't be returned directly, and instead has to be allocated into a separate string to be returned. If indices to the line were provided, then these could simply be saved and the original string buffer could be truncated to just the line, saving on allocations.

Additionally, beyond this (relatively contrived) example, indices are also useful for expanding failed lines into a multi-line span, like Rust does with its error messages. If the position of the line is obtained, then the string can be scanned backward and forward some number of lines back, and those indices can be used instead to slice a larger span. Without this, the caller would have to keep a buffer of some number of lines back, then later concatenate them into a larger string on error, which seems like way more work.

Note on workaround

It is possible to create this today by accumulating the total length of lines passed to the iterator to ultimately create an equivalent position per line, although this only works with split_inclusive since split delimiters can have variable length. While lines uses split_inclusive, this means that the lines case is possible to replicate, but it feels like the cases are common enough to allow a dedicated method.

There is also the alternative match_indices which will provide the exact indices necessary for not replicating too much work on the user's end, but again, it feels like the benefit outweighs the cost here.

Solution sketches

While the main desire here is for a line_indices iterator, many splitting iterators all use an internal SplitInternal iterator, and it feels reasonable to create a with-indices version of this, then augment all of the related methods to have a with-indices version. This would look something like:

impl str {
    pub fn line_indices(&self) -> LineIndices<'a>;
    pub fn split_indices<'a, P: Pattern<'a>>(&'a self, pat: P) -> SplitIndices<'a, P>;
    pub fn split_inclusive_indices<'a, P: Pattern<'a>>(&'a self, pat: P) -> SplitInclusiveIndices<'a, P>;
    pub fn rsplit_indices<'a, P: Pattern<'a, Searcher: ReverseSearcher<'a>>>(&'a self, pat: P) -> RSplitIndices<'a, P>;
    pub fn split_terminator_indices<'a, P: Pattern<'a>>(&'a self, pat: P) -> SplitTerminatorIndices<'a, P>;
    pub fn rsplit_terminator_indices<'a, P: Pattern<'a, Searcher: ReverseSearcher<'a>>>(&'a self, pat: P) -> RSplitTerminatorIndices<'a, P>;
}

And all of the iterators would return (usize, &str) items instead of just &str, to mirror char_indices which returns (usize, char).

This could potentially be extended to the splitn variants, which have a separate SplitNInternal iterator:

impl str {
    pub fn splitn_indices<'a, P: Pattern<'a>>(&'a self, n: usize, pat: P) -> SplitNIndices<'a, P>;
    pub fn rsplitn_indices<'a, P: Pattern<'a, Searcher: ReverseSearcher<'a>>>(&'a self, n: usize, pat: P) -> RSplitNIndices<'a, P>;
}

Links and related work

N/A for now

What happens now?

This issue is part of the libs-api team API change proposal process. Once this issue is filed the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

@clarfonthey clarfonthey added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels May 15, 2023
@pitaj
Copy link

pitaj commented May 15, 2023

This runs into the same issue as filling out the str API with things like rsplit_inclusive: it adds a lot of methods that are pretty niche.

This is useful however and could easily be built on top of my proposal for a string splitting builder API: #168

@m-ou-se
Copy link
Member

m-ou-se commented May 16, 2023

Is this the best/simplest API for this? What about e.g. a .index() on the existing Split types? Are there other (simple) alternatives?

@joshtriplett
Copy link
Member

This seems like a large number of additional APIs for something that could have a few different alternatives.

This seems like it could be done by checking the offset of the &str from the original &str. We don't currently seem to have a clean API for that, but we could add one, and that seems like it'd solve various problems including this one.

@BurntSushi
Copy link
Member

We don't currently seem to have a clean API for that, but we could add one, and that seems like it'd solve various problems including this one.

I like this idea.

@clarfonthey
Copy link
Author

I personally like @pitaj's idea of using a builder API for this. Honestly, despite the builder pattern being incredibly widespread across the Rust ecosystem, we don't see it too much in the standard library. I'm going to close this in favour of that proposal.

@clarfonthey clarfonthey closed this as not planned Won't fix, can't repro, duplicate, stale Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

5 participants