-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Added compile time optimization for bytewise slice comparison #39422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @aturon (or someone else) soon. If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes. Please see the contribution instructions for more information. |
cc @rust-lang/libs @bluss: thoughts on using build scripts in core, and this optimization? @saschagrunert Do you have any benchmark results for this change? Generally we like to see evidence of significant speedup before introducing optimizations like this, since they impose a maintenance burden. |
I'm not sure how relevant this is, but as far as I can tell this will cause unaligned loads (which should generally be avoided as far as I know). I also think it would be nice if the |
Good point. Unaligned loads must be avoided, otherwise the code is not portable. |
From a build perspective using a build script seems fine to me. I'd be slightly worried about code bloat in libcore, but we can evaluate that later. |
I don't have a problem per se with build scripts in libcore, but this is quite a lot of ceremony for one little optimization. I do think that bloat in core is a problem. I'm not clear on what this patch is doing, but if we add a bunch of code for optimization purposes, we will definitely be called upon to remove them again conditionally someday. |
@aturon yes benchmarks are coming from the fastcmp crate (including their source) for a 256 byte slice comparison on my local machine:
I definitely see the code bloat point, since the compilation time increases drastically if I higher the supported slice length. The optimization will improve the usual fn slice_compare(a: *const i8, b: *const i8, len: usize) -> bool {
let mut bits = len * 8;
let mut offset = 0;
while bits > 0 {
if bits >= 128 {
if cmp!(a, b, u128, offset) == false {
return false;
}
bits -= 128;
offset += 16;
} else if bits >= 64 {
if cmp!(a, b, u64, offset) == false {
return false;
}
bits -= 64;
offset += 8;
} else if bits >= 32 {
if cmp!(a, b, u32, offset) == false {
return false;
}
bits -= 32;
offset += 4;
} else if bits >= 16 {
if cmp!(a, b, u16, offset) == false {
return false;
}
bits -= 16;
offset += 2;
} else if bits >= 8 {
if cmp!(a, b, u8, offset) == false {
return false;
}
bits -= 8;
offset += 1;
} else {
unreachable!();
}
}
true
} But the thing is that with the generated source code a optimization for known slices is possible at compile time: macro_rules! slice_compare (
($a:expr, $b:expr, $len:expr) => {{
match $len {
1 => cmp!($a, $b, u8, 0),
2 => cmp!($a, $b, u16, 0),
3 => cmp!($a, $b, u16, 0) && cmp!($a, $b, u8, 2),
4 => cmp!($a, $b, u32, 0),
....
|
How do we properly evaluate if the code size expansion is worth it? I'd look at existing benchmarks in larger programs. |
I am not sure how the regex crate will use slice comparisons but here are the results of the benchmarks:
Around 20% is the performance improvement I got with the same technique within C projects. |
// The compare macro for the bytewise slice comparison | ||
macro_rules! cmp ( | ||
($left:expr, $right: expr, $var:ident, $offset:expr) => {{ | ||
unsafe {*($left.offset($offset) as *const $var) == *($right.offset($offset) as *const $var)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a possibly unaligned load (for example *const u64
used that is not necessarily well aligned for u64
. We can't merge the code like this, because it is not portable and will crash on certain platforms.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixing this issue will likely have an impact on the performance. In order to get the best possible performance on platforms which allow unaligned memory access we might need/want to perform this optimisation at a lower level (i.e. replace the implementation of memcmp
and/or the strategy used in LLVM to replace memcmp
with inline comparisons).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am currently not sure how to fix this unaligned loads here. Is there something like get_unaligned
in the compiler?
Regarding the bloat, I want to cc #39492, which flags a non-trivial bloat in libcore that affects embedded device development. The libs team will soon be considering whether to set up a general policy of optimizing core for code size. We should consider the impact of this PR in that light as well. |
Nominating for libs team discussion. I feel like we need some clearer policies around these kinds of optimization tradeoffs. |
It'd be interesting to graph code size vs OPTLEN and maybe speed. I suspect we can lower it to 64 and still heap most of the benefits while being more comfortable with the extra bloat. |
There’s little to be done here other than providing an efficient implementation of memcmp (possibly with alignment argument, diverging from C memcmp somewhat) and using that to do comparisons like these. Now that rustc supports things like |
Discussed during libs triage today our conclusion was that we should probably at least fix the unaligned load business before evaluating performance, and then after that we can decide whether we need this in core or if it can live externally. |
It may also be worth investigating performance across platforms, I'd expect memcmp to be much faster with glibc, for example, than with OSX |
I had the chance to think about the issue with this pull request and really think that the optimisation should fit better into LLVM. So we could add an memcmp intrinsic there... |
I personally feel like it is a much better idea to just spend time and write our own I feel this is one of the places @BurntSushi would have interesting things to say, as they spent considerable amount of time implementing efficient string search (i.e. somewhat related) for their regex crate. Adding memcmp intrinsic to LLVM would be very non-trivial, sadly. |
I'm going to close this PR for the time being, pending the actions outlined here. Thanks! |
Inspiration for this change is my crate fastcmp. The idea is to add an additional step before calling directly
memcmp
, which will optimize thePartialEq
trait for slices up to256
bytes.I think this will relate some how to #16913.