-
Notifications
You must be signed in to change notification settings - Fork 20
Add substr_range
, elem_offset
, and subslice_range
methods
#382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
subslice_offset
methods
We reviewed this in today's @rust-lang/libs-api meeting. We felt that the concept was reasonable, but the names and signatures needed some adjustment.
(In theory we could also add a |
|
EDIT: It seems that I was confused about the RFC/ACP process (see rust-lang/rfcs#3648 (comment)). I've updated the ACP with your suggestions. Thank you for reviewing my APC. I also like the
|
subslice_offset
methodssubstr_range
, elem_offset
, and subslice_range
methods
I think that panicking instead of returning |
These functions are essentially doing Also note that the current implementation can give nonsense result without invoking unsafe code. fn main() {
let a = [1, 2, 3, 4];
let b = [5, 6, 7, 8];
dbg!(a.subslice_range(&a[4..4])); // good, Some(4..4)
dbg!(b.subslice_range(&a[4..4])); // bad!, expected None got Some(0..0)
} |
Hmm, and provenance causes a mess here, doesn't it. Checking that the pointers meet |
I think fn str_offset_if_non_empty(a: &str, b: &str) -> Option<usize> {
let b = b.as_bytes().first()?;
a.as_bytes().elem_offset(b)
} |
actually, let arr = [[0, 1], [2, 3]];
let weird_elm: &[_; 2] = &arr.flatten()[1..3].try_into().unwrap();
assert_eq!(weird_elm, [1, 2]);
let offset = arr.elem_offset(weird_elm);
assert_eq!(offset, None); // what I expect, since it isn't exactly at any one element |
We could return an Alternatively we could always turn empty ranges into Or we could let the weird cases through and document that this probably isn't the method you'd want if you're doing anything |
We can check the remainder. I was curious about the performance impact for cases where this fn elem_offset(&self, element: &T) -> Option<usize> {
let self_start = self.as_ptr() as usize;
let elem_start = element as *const T as usize;
let byte_offset = elem_start.wrapping_sub(self_start);
let offset = byte_offset / core::mem::size_of::<T>();
if byte_offset % core::mem::size_of::<T>() != 0 {
return None;
}
if offset < self.len() {
Some(offset)
} else {
None
}
} With this change, I tested the following code, and it appears to work: let arr = [[0, 1], [2, 3]];
let flat_array: &[u32; 4] = unsafe { &*addr_of!(arr).cast::<[u32; 4]>() };
let ok_elm: &[_; 2] = flat_array[0..2].try_into().unwrap();
let weird_elm: &[_; 2] = flat_array[1..3].try_into().unwrap();
assert_eq!(ok_elm, &[0, 1]);
assert_eq!(weird_elm, &[1, 2]);
assert_eq!(arr.elem_offset(ok_elm), Some(0)); // This still works
assert_eq!(arr.elem_offset(weird_elm), None); // This correctly returns `None` |
The issue with I think that cases with zero length ranges where let foo = "a,bc,d,";
for sub in foo.split(",") {
let idx = foo.substr_range(sub).unwrap();
let sub = &foo[idx.start.saturating_sub(1)..idx.end];
// `sub` would be "a", ",bc", ",d", and then, it would panic on the `unwrap`
} However, it seems impossible to truly know if something was from the same allocation or not, and it's weird to have the behavior be "this probably will return
EDIT: Nevermind, see #382 (comment) |
Requiring fn subslice_range(&self, subslice: &[T]) -> Option<Range<usize>> {
let self_start = self.as_ptr() as usize;
let subslice_start = subslice.as_ptr() as usize;
let start = subslice_start.wrapping_sub(self_start) / core::mem::sizeof::<T>();
let end = start + subslice.len();
if start < self.len() && end <= self.len() {
Some(start..end)
} else {
None
}
} This is definitely less powerful, but it does still have some use cases. If you look at the example from rust doc shown here, you could easily implement that using this watered-down version like so: /// Separate any lines at the start of the file that begin with `%`.
fn extract_leading_metadata<'a>(s: &'a str) -> (Vec<&'a str>, &'a str) {
let mut metadata = Vec::new();
for line in s.lines() {
if line.starts_with("%") {
// remove %<whitespace>
metadata.push(line[1..].trim_left());
} else {
let Some(line_idx) = s.substr_range(line) else { break };
return (metadata, &s[line_idx.start..]);
}
}
// if we're here, then all lines were metadata % lines.
(metadata, "")
} |
It would definitely be worth documenting that. This I did think about the clobbering issue when coming up with the original ACP. So far though, I haven't had any issues with the standard library functions doing that. I can see how something like that could cause weird/unexpected behavior though with other crates. It might be worth mentioning this somewhere in the proposed method descriptions. |
I just realized that if we return let foo = "a,bc,d,";
for sub in foo.split(",") {
let idx = foo.substr_range(sub).unwrap_or(foo.len()..foo.len());
let sub = &foo[idx.start.saturating_sub(1)..idx.end];
// `sub` would be "a", ",bc", ",d", and ","
} As mentioned in this comment, this may fix the issue. This change makes code that uses these methods slightly more verbose, but it definitely works! I've updated the ACP with this change along with the remainder change. |
@wr7 I think that's sufficiently tricky that it'd be preferable to not worry about the provenance case. You shouldn't be able to do anything broken with the return value in safe code even if you passed something from an adjacent allocation, so it's only unsafe code where you'd have to be careful about that. |
Another (less important) edge case would be slices of ZSTs. For ZSTs, these methods obviously do not work. For those, I think that |
Well, this is where it's important to be extremely precise about what the postcondition for the method is. It's possible that Also, for something type-specific, it might be reasonable to panic, since that panic is clearly either always hit or never hit by the type. |
Yeah. I think that panicking definitely makes sense. For non-zero-sized types, this would obviously have zero overhead. In my opinion, panicking is also the least surprising behavior for this edge-case. When writing type-generic code, one may expect the following assertion to succeed if fn make_assertion<T>(foo: &[T], n: usize) {
let elem = &foo[n];
assert_eq!(foo.elem_offset(elem), Some(n)); // Cannot always be upheld if `T` is a ZST
} Returning Additionally, I think returning fn elem_offset(&self, element: &T) -> Option<usize> {
let self_start = self.as_ptr() as usize;
let elem_start = element as *const T as usize;
(self_start == elem_start).then_some(0)
} With this implementation
This means that The following code demonstrates this unreliability: let zero_slc = [(); 5];
let val_ref = ();
assert_eq!(zero_slc.elem_offset(&zero_slc[0]), Some(0)); // this should always succeed
assert_eq!(zero_slc.elem_offset(&val_ref), None); // inconsistent: currently succeeds in debug mode but not release mode |
Proposal
Add the following methods
NOTE: These methods are completely different in both behavior and functionality from
str::find
and friends. They do not compare characters/items inside of thestr
/slice
s. Instead they utilize pointer arithmetic to find where a subslice/item is located in thestr
/slice
.Problem statement
This change attempts to fix two distinct issue. The first of which is the inflexibility of
slice::split
,str::split
,str::matches
, and related methods. The second is for parse errors.Problem 1
str::split
and its related functions are really convenient, but I find myself having to manually implement them if I want more complex logic.For instance, lets say that I want to split up a string by commas. Specifically, I want to do something like
str::split_inclusive
, but I want to include the separators at the beginning instead of the end.You cannot do this with
str::split
Instead, you have to manually do it yourself
Problem 2
Say I have a function for parsing and a helper function. The
Range
denotes where in the string the error occurs. We can use thatRange
to tell the user where the error is if one occurs.The issue with this is that the range error from
parse_helper
is relative to the substring passed to it. This means that we have to manually offset theRange
inside ofparse
.We could also pass an offset to
parse_helper
and use that to adjust the returnedRange
, but that would just move the complexity to theparse_helper
function.This gets even worse if instead of a
Range
, we have an enum that containsRange
s. SayMotivating examples or use cases
Split and similar methods
subslice_offset
would allow using indices to extend methods likestr::split
. We can implement the aforementioned inclusivestr::split
like so:Error handling with string input data
We can also use this method to remove complexity from the code described in Problem 2.
Instead of returning
Range
s, we can return&'a str
. Then, if a caller ofparse
wants to find where the error occurred, they can doSolution sketch
The following code demonstrates the above behavior
Alternatives
Links and related work
subslice_offset crate
Old, deprecated
str::subslice_offset
. (deprecated here).This was deprecated with
str::find
being listed as its replacement, but as mentioned before, this has different functionality.Original
subslice_offset
PRrust-lang/rfcs#2796 (an RFC similar to this but only for slices). The PR for this was abandoned because the author did not have time to make suggested changes.
https://github.com/wr7/rfcs/blob/substr_range/text/3648-substr-range.md - An RFC that I wrote for this
https://stackoverflow.com/questions/50781561/how-to-find-the-starting-offset-of-a-string-slice-of-another-string
The text was updated successfully, but these errors were encountered: