Skip to content

Add substr_range, elem_offset, and subslice_range methods #382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #126769
wr7 opened this issue May 26, 2024 · 18 comments
Closed
Tracked by #126769

Add substr_range, elem_offset, and subslice_range methods #382

wr7 opened this issue May 26, 2024 · 18 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@wr7
Copy link

wr7 commented May 26, 2024

Proposal

Add the following methods

impl str {
    fn substr_range(&self, item: &str) -> Option<Range<usize>>;
}
impl<T> [T] {
    fn subslice_range(&self, item: &[T]) -> Option<Range<usize>>;
    fn elem_offset(&self, item: &T) -> Option<usize>;
}

NOTE: These methods are completely different in both behavior and functionality from str::find and friends. They do not compare characters/items inside of the str/slices. Instead they utilize pointer arithmetic to find where a subslice/item is located in the str/slice.

Problem statement

This change attempts to fix two distinct issue. The first of which is the inflexibility of slice::split, str::split, str::matches, and related methods. The second is for parse errors.

Problem 1

str::split and its related functions are really convenient, but I find myself having to manually implement them if I want more complex logic.

For instance, lets say that I want to split up a string by commas. Specifically, I want to do something like str::split_inclusive, but I want to include the separators at the beginning instead of the end.

You cannot do this with str::split

let foo = "a,bc,d";
for sub in foo.split(",") {
    // Now I have the split up string, but it doesn't have the comma at the beginning :-(
}

Instead, you have to manually do it yourself

    let mut foo = "a,bc,d";

    while let Some(comma_idx) = foo.get(1..).map(|f| 1+f.find(",").unwrap_or(f.len())) {
        let sub = &foo[..comma_idx];
        foo = &foo[comma_idx..];

        // sub now has leading comma
    }

Problem 2

Say I have a function for parsing and a helper function. The Range denotes where in the string the error occurs. We can use that Range to tell the user where the error is if one occurs.

pub fn parse<'a>(in: &'a str) -> Result<usize, Range<usize>> {
    // omitted //
    let a = parse_helper(&in[x..y])?;
    let b = parse_helper(&in[y+1..])?;
   // omitted //
}
fn parse_helper<'a>(in: &'a str) -> Result<usize, Range<usize>> { /* omitted */ }

The issue with this is that the range error from parse_helper is relative to the substring passed to it. This means that we have to manually offset the Range inside of parse.

pub fn parse<'a>(in: &'a str) -> Result<usize, Range<usize>> {
    // omitted //
    let a = parse_helper(&in[x..y]).map_err(|e| e.start + x..e.end + x)?;
    let b = parse_helper(&in[y+1..]).map_err(|e| e.start + y+1..e.end + y+1)?;
   // omitted //
}
// ... //

We could also pass an offset to parse_helper and use that to adjust the returned Range, but that would just move the complexity to the parse_helper function.

This gets even worse if instead of a Range, we have an enum that contains Ranges. Say

enum ParseError {
   Error1(Range<usize>),
   Error2(Range<usize>, Range<usize>),
}

Motivating examples or use cases

Split and similar methods

subslice_offset would allow using indices to extend methods like str::split. We can implement the aforementioned inclusive str::split like so:

let foo = "a,bc,d,";
for sub in foo.split(",") {
    let idx = foo.substr_range(sub).unwrap();
    let sub = &foo[idx.start.saturating_sub(1)..idx.end];
    // Now `sub` is "a", ",bc", ",d", and ","
}

Error handling with string input data

We can also use this method to remove complexity from the code described in Problem 2.

pub fn parse<'a>(in: &'a str) -> Result<usize, &'a str> {
    // omitted //
    let a = parse_helper(&in[x..y])?;
    let b = parse_helper(&in[y+1..])?;
   // omitted //
}
fn parse_helper<'a>(in: &'a str) -> Result<usize, &'a str> { /* omitted */ }

Instead of returning Ranges, we can return &'a str. Then, if a caller of parse wants to find where the error occurred, they can do

let input = "9+10-21";
match parse(input) {
    Ok(val) => { /* omitted */ },
    Err(err) => {
        display_error(input.substr_range(err).unwrap());
    },

Solution sketch

impl str {
    fn substr_range(&self, substr: &str) -> Option<Range<usize>> {
        self.as_bytes().subslice_range(substr.as_bytes())
    }
}

impl<T> [T] {
    fn subslice_range(&self, subslice: &[T]) -> Option<Range<usize>> {
        if core::mem::size_of::<T>() == 0 {
            panic!();
        }

        let self_start = self.as_ptr() as usize;
        let subslice_start = subslice.as_ptr() as usize;

        let byte_start = subslice_start.wrapping_sub(self_start);
        let start = byte_start / core::mem::size_of::<T>();
        let end = start.wrapping_add(subslice.len());

        if byte_start % core::mem::size_of::<T>() != 0 {
            return None;
        }

        if start <= self.len() && end <= self.len() {
            Some(start..end)
        } else {
            None
        }
    }

    fn elem_offset(&self, element: &T) -> Option<usize> {
        if core::mem::size_of::<T>() == 0 {
            panic!();
        }

        let self_start = self.as_ptr() as usize;
        let elem_start = element as *const T as usize;

        let byte_offset = elem_start.wrapping_sub(self_start);
        let offset = byte_offset / core::mem::size_of::<T>();

        if byte_offset % core::mem::size_of::<T>() != 0 {
            return None;
        }

        if offset < self.len() {
            Some(offset)
        } else {
            None
        }
    }
}

The following code demonstrates the above behavior

    let foo = "mississippi".to_owned();
    let is = &foo[4..6];

    assert_eq!(is, "is");
    assert_eq!(foo.substr_offset(is), Some(4..6));

    assert_eq!(foo.substr_offset("is"), None); 

Alternatives

Links and related work

subslice_offset crate

Old, deprecated str::subslice_offset. (deprecated here).
This was deprecated with str::find being listed as its replacement, but as mentioned before, this has different functionality.

Original subslice_offset PR

rust-lang/rfcs#2796 (an RFC similar to this but only for slices). The PR for this was abandoned because the author did not have time to make suggested changes.

https://github.com/wr7/rfcs/blob/substr_range/text/3648-substr-range.md - An RFC that I wrote for this

https://stackoverflow.com/questions/50781561/how-to-find-the-starting-offset-of-a-string-slice-of-another-string

@wr7 wr7 added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels May 26, 2024
@wr7 wr7 changed the title Method for finding index of a substring Bring back subslice_offset methods May 26, 2024
@joshtriplett
Copy link
Member

We reviewed this in today's @rust-lang/libs-api meeting. We felt that the concept was reasonable, but the names and signatures needed some adjustment.

  • These should return an Option, and return None if out of bounds.
  • Since these have different return types and semantics, they should have different naming: substr_range for the str method, and elem_offset for the [T] method.

(In theory we could also add a subslice_range method that takes &[T] and returns Option<Range<usize>>, but that wasn't part of this ACP.)

@scottmcm
Copy link
Member

subslice_range does seem quite reasonable, since one can always .start if needed, and the rest will optimize out easily.

@wr7
Copy link
Author

wr7 commented May 28, 2024

EDIT: It seems that I was confused about the RFC/ACP process (see rust-lang/rfcs#3648 (comment)). I've updated the ACP with your suggestions.

Thank you for reviewing my APC. I also like the elem_offset and substr_range names. The subslice_range method also makes sense.

I've made an RFC PR for this.

@wr7 wr7 closed this as completed May 28, 2024
@wr7 wr7 reopened this May 28, 2024
@wr7 wr7 changed the title Bring back subslice_offset methods Add substr_range, elem_offset, and subslice_range methods May 28, 2024
@wr7
Copy link
Author

wr7 commented May 28, 2024

I think that panicking instead of returning None might make sense though. This isn't a hill that I'm willing to die on, but returning an Option requires an unwrap which makes using this more verbose. I feel like in most cases (such as when using it with str::split or str::lines), substr_range should never return None.

@kennytm
Copy link
Member

kennytm commented May 29, 2024

These functions are essentially doing pointer::offset_from. However, these functions are marketed as "safe" and thus can't assume the input is really a subslice of self, and thus must use self.as_ptr() as usize - subslice.as_ptr() as usize and forego potential optimization opportunity.

Also note that the current implementation can give nonsense result without invoking unsafe code.

fn main() {
    let a = [1, 2, 3, 4];
    let b = [5, 6, 7, 8];
    dbg!(a.subslice_range(&a[4..4])); // good, Some(4..4)
    dbg!(b.subslice_range(&a[4..4])); // bad!, expected None got Some(0..0)
}

@scottmcm
Copy link
Member

scottmcm commented May 29, 2024

However, these functions are marketed as "safe" and thus can't assume the input is really a subslice of self

Hmm, and provenance causes a mess here, doesn't it. Checking that the pointers meet slice.begin <= item <= slice.end isn't enough to be able to sub_ptr, since that can be true even if item comes from a separate allocation that just so happens to be right after self in memory -- especially with item = &[] :/

@programmerjake
Copy link
Member

I think elem_offset can be implemented reliably since it doesn't have to handle the one-past-the-end edge-case which the others have to. it would be kinda a pain but you can still use it to retrieve string offsets:

fn str_offset_if_non_empty(a: &str, b: &str) -> Option<usize> {
    let b = b.as_bytes().first()?;
    a.as_bytes().elem_offset(b)
}

@programmerjake
Copy link
Member

programmerjake commented May 29, 2024

actually, elem_offset still has a gotcha: what if you pass it a pointer to an element that overlaps two elements?

let arr = [[0, 1], [2, 3]];
let weird_elm: &[_; 2] = &arr.flatten()[1..3].try_into().unwrap();
assert_eq!(weird_elm, [1, 2]);
let offset = arr.elem_offset(weird_elm);
assert_eq!(offset, None); // what I expect, since it isn't exactly at any one element

@the8472
Copy link
Member

the8472 commented May 29, 2024

However, these functions are marketed as "safe" and thus can't assume the input is really a subslice of self

Hmm, and provenance causes a mess here, doesn't it. Checking that the pointers meet slice.begin <= item <= slice.end isn't enough to be able to sub_ptr, since that can be true even if item comes from a separate allocation that just so happens to be right after self in memory -- especially with item = &[] :/

We could return an Option<RangeInclusive> which naturally excludes empty slices. Perhaps that's not always what people want, but then they're forced to think about where, logically, an empty slice should be located in a string, it could be practically anywhere. Sure, the reference might come with a valid address, but that might get clobbered anyway along the way by things that conjure up &[] from scratch so it's iffy to rely on that.

Alternatively we could always turn empty ranges into None

Or we could let the weird cases through and document that this probably isn't the method you'd want if you're doing anything unsafe with those references since it's equivalent to wrapping_sub, not offset_from.

@wr7
Copy link
Author

wr7 commented May 29, 2024

actually, elem_offset still has a gotcha: what if you pass it a pointer to an element that overlaps two elements?

let arr = [[0, 1], [2, 3]];
let weird_elm: &[_; 2] = &arr.flatten()[1..3].try_into().unwrap();
assert_eq!(weird_elm, [1, 2]);
let offset = arr.elem_offset(weird_elm);
assert_eq!(offset, None); // what I expect, since it isn't exactly at any one element

We can check the remainder. I was curious about the performance impact for cases where this arr.flatten thing isn't an issue, so I used Godbolt to see if this would change the generated assembly for [u32]'s elem_offset, and it didn't. I assume this is because both the element and the slice are aligned to four bytes, and the compiler knows this.

    fn elem_offset(&self, element: &T) -> Option<usize> {
        let self_start = self.as_ptr() as usize;
        let elem_start = element as *const T as usize;

        let byte_offset = elem_start.wrapping_sub(self_start);
        let offset = byte_offset / core::mem::size_of::<T>();

        if byte_offset % core::mem::size_of::<T>() != 0 {
            return None;
        }

        if offset < self.len() {
            Some(offset)
        } else {
            None
        }
    }

Godbolt

With this change, I tested the following code, and it appears to work:

    let arr = [[0, 1], [2, 3]];
    let flat_array: &[u32; 4] = unsafe { &*addr_of!(arr).cast::<[u32; 4]>() };

    let ok_elm: &[_; 2] = flat_array[0..2].try_into().unwrap();
    let weird_elm: &[_; 2] = flat_array[1..3].try_into().unwrap();

    assert_eq!(ok_elm, &[0, 1]);
    assert_eq!(weird_elm, &[1, 2]);

    assert_eq!(arr.elem_offset(ok_elm), Some(0)); // This still works
    assert_eq!(arr.elem_offset(weird_elm), None); // This correctly returns `None`

@wr7
Copy link
Author

wr7 commented May 29, 2024

The issue with subslice_range where start == end and start == self.len() is definitely tricky.

I think that cases with zero length ranges where start == self.len() and zero length ranges in general are very valuable. If you look at the example in this ACP, disallowing this case would cause the following to panic:

let foo = "a,bc,d,";
for sub in foo.split(",") {
    let idx = foo.substr_range(sub).unwrap();
    let sub = &foo[idx.start.saturating_sub(1)..idx.end];
    // `sub` would be "a", ",bc", ",d", and then, it would panic on the `unwrap`
}

However, it seems impossible to truly know if something was from the same allocation or not, and it's weird to have the behavior be "this probably will return None, but it may return Some(self.len()..self.len()) under very specific conditions".

I think it might make more sense to panic instead. I feel like, in most use cases, such as the ones presented in my ACP, if the user calls one of these methods, and the given slice is not a subslice, the user's code contains a bug somewhere. I can't really imagine too many use cases where a slice may point inside of another slice or it may point somewhere else.

I think this helps with this issue because it discourages people from writing code that checks if a subslice is contained within another slice (which we've ascertained to be impossible in certain circumstances). I think that it also makes more sense to have methods where "they probably will panic under certain circumstances, but depending on the behavior of the compiler, they may just return self.len()..self.len()". This is kinda how the integer arithmetic operations work where in Debug mode, they panic, but in release mode, they may wrap around.

This is really the most ergonomic and practical route for this in my opinion, but I can definitely see how an implementation like this could still be undesirable for the standard library.

EDIT: Nevermind, see #382 (comment)

@wr7
Copy link
Author

wr7 commented May 29, 2024

Requiring start to be less than self.len() may also fix this issue. We could do:

    fn subslice_range(&self, subslice: &[T]) -> Option<Range<usize>> {
        let self_start = self.as_ptr() as usize;
        let subslice_start = subslice.as_ptr() as usize;

        let start = subslice_start.wrapping_sub(self_start) / core::mem::sizeof::<T>();
        let end = start + subslice.len();

        if start < self.len() && end <= self.len() {
            Some(start..end)
        } else {
            None
        }
    }

This is definitely less powerful, but it does still have some use cases. If you look at the example from rust doc shown here, you could easily implement that using this watered-down version like so:

/// Separate any lines at the start of the file that begin with `%`.
fn extract_leading_metadata<'a>(s: &'a str) -> (Vec<&'a str>, &'a str) {
    let mut metadata = Vec::new();
    for line in s.lines() {
        if line.starts_with("%") {
            // remove %<whitespace>
            metadata.push(line[1..].trim_left());
        } else {
            let Some(line_idx) = s.substr_range(line) else { break };
            return (metadata, &s[line_idx.start..]);
        }
    }
    // if we're here, then all lines were metadata % lines.
    (metadata, "")
}

@wr7
Copy link
Author

wr7 commented May 29, 2024

However, these functions are marketed as "safe" and thus can't assume the input is really a subslice of self

Hmm, and provenance causes a mess here, doesn't it. Checking that the pointers meet slice.begin <= item <= slice.end isn't enough to be able to sub_ptr, since that can be true even if item comes from a separate allocation that just so happens to be right after self in memory -- especially with item = &[] :/

We could return an Option<RangeInclusive> which naturally excludes empty slices. Perhaps that's not always what people want, but then they're forced to think about where, logically, an empty slice should be located in a string, it could be practically anywhere. Sure, the reference might come with a valid address, but that might get clobbered anyway along the way by things that conjure up &[] from scratch so it's iffy to rely on that.

Alternatively we could always turn empty ranges into None

Or we could let the weird cases through and document that this probably isn't the method you'd want if you're doing anything unsafe with those references since it's equivalent to wrapping_sub, not offset_from.

It would definitely be worth documenting that.

This RangeInclusive version would definitely make the self.len()..self.len() edge case less surprising because self.len()..self.len() would be unrepresentable with RangeInclusive, but this would also take away a lot of power from these methods, so I don't think that it would be worth it.

I did think about the clobbering issue when coming up with the original ACP. So far though, I haven't had any issues with the standard library functions doing that. I can see how something like that could cause weird/unexpected behavior though with other crates. It might be worth mentioning this somewhere in the proposed method descriptions.

@wr7
Copy link
Author

wr7 commented May 29, 2024

The issue with subslice_range where start == end and start == self.len() is definitely tricky.

I think that cases with zero length ranges where start == self.len() and zero length ranges in general are very valuable. If you look at the example in this ACP, disallowing this case would cause the following to panic:

let foo = "a,bc,d,";
for sub in foo.split(",") {
    let idx = foo.substr_range(sub).unwrap();
    let sub = &foo[idx.start.saturating_sub(1)..idx.end];
    // `sub` would be "a", ",bc", ",d", and then, it would panic on the `unwrap`
}

However, it seems impossible to truly know if something was from the same allocation or not, and it's weird to have the behavior be "this probably will return None, but it may return Some(self.len()..self.len()) under very specific conditions".

I think it might make more sense to panic instead. I feel like, in most use cases, such as the ones presented in my ACP, if the user calls one of these methods, and the given slice is not a subslice, the user's code contains a bug somewhere. I can't really imagine too many use cases where a slice may point inside of another slice or it may point somewhere else.

I think this helps with this issue because it discourages people from writing code that checks if a subslice is contained within another slice (which we've ascertained to be impossible in certain circumstances). I think that it also makes more sense to have methods where "they probably will panic under certain circumstances, but depending on the behavior of the compiler, they may just return self.len()..self.len()". This is kinda how the integer arithmetic operations work where in Debug mode, they panic, but in release mode, they may wrap around.

This is really the most ergonomic and practical route for this in my opinion, but I can definitely see how an implementation like this could still be undesirable for the standard library.

I just realized that if we return None when start == self.len(), you can just do:

let foo = "a,bc,d,";
for sub in foo.split(",") {
    let idx = foo.substr_range(sub).unwrap_or(foo.len()..foo.len());
    let sub = &foo[idx.start.saturating_sub(1)..idx.end];
    // `sub` would be "a", ",bc", ",d", and ","
}

As mentioned in this comment, this may fix the issue. This change makes code that uses these methods slightly more verbose, but it definitely works!

I've updated the ACP with this change along with the remainder change.

@joshtriplett
Copy link
Member

@wr7 I think that's sufficiently tricky that it'd be preferable to not worry about the provenance case. You shouldn't be able to do anything broken with the return value in safe code even if you passed something from an adjacent allocation, so it's only unsafe code where you'd have to be careful about that.

@wr7
Copy link
Author

wr7 commented May 29, 2024

Another (less important) edge case would be slices of ZSTs. For ZSTs, these methods obviously do not work.

For those, I think that None should be returned.

@scottmcm
Copy link
Member

For ZSTs, these methods obviously do not work.

Well, this is where it's important to be extremely precise about what the postcondition for the method is. It's possible that Some(0) for same-address and None otherwise would be a legal implementation.

Also, for something type-specific, it might be reasonable to panic, since that panic is clearly either always hit or never hit by the type.

@wr7
Copy link
Author

wr7 commented May 30, 2024

For ZSTs, these methods obviously do not work.

Well, this is where it's important to be extremely precise about what the postcondition for the method is. It's possible that Some(0) for same-address and None otherwise would be a legal implementation.

Also, for something type-specific, it might be reasonable to panic, since that panic is clearly either always hit or never hit by the type.

Yeah. I think that panicking definitely makes sense. For non-zero-sized types, this would obviously have zero overhead.

In my opinion, panicking is also the least surprising behavior for this edge-case. When writing type-generic code, one may expect the following assertion to succeed if n < foo.len(), but with ZSTs, that behavior cannot be upheld. I think panicking instead of having this edge case would be more sane.

fn make_assertion<T>(foo: &[T], n: usize) {
    let elem = &foo[n];

    assert_eq!(foo.elem_offset(elem), Some(n)); // Cannot always be upheld if `T` is a ZST
}

Returning None or Some(0) would make bugs that are based off of this assumption harder to track down.

Additionally, I think returning Some(0) or any non-None value based on address comparisons is too unpredictable/unreliable.
Suppose that, for ZSTs, we do:

    fn elem_offset(&self, element: &T) -> Option<usize> {
        let self_start = self.as_ptr() as usize;
        let elem_start = element as *const T as usize;

        (self_start == elem_start).then_some(0)
    }

With this implementation elem_offset with zero sized types has exactly two behaviors:

  • If element points inside of self, Some(0) will be returned.
  • If element was obtained elsewhere, the return value is inconsistent. It may return None, but it could also return Some(0).

This means that elem_offset would only have one guaranteed return value.

The following code demonstrates this unreliability:

    let zero_slc = [(); 5];
    let val_ref = ();

    assert_eq!(zero_slc.elem_offset(&zero_slc[0]), Some(0)); // this should always succeed
    assert_eq!(zero_slc.elem_offset(&val_ref), None); // inconsistent: currently succeeds in debug mode but not release mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

6 participants