Perf regression with LLVM 16 for `slice::sort` #111559

Voultapher · 2023-05-14T13:19:32Z

Code

I tried this code, sorting 30 million random u64:

fn main() {
    use std::time::Instant;

    let mut v = sort_test_tools::patterns::random(30_000_000)
        .into_iter()
        .map(|e| e as u64)
        .collect::<Vec<u64>>();

    let start = Instant::now();
    sort_comp::stable::rust_std::sort(&mut v);

    let diff = start.elapsed().as_millis();

    if diff >= 1720 {
        panic!();
    }
}

On this machine:

Linux 6.3.1
AMD Ryzen 9 5900X 12-Core Processor (Zen3 micro-architecture)

Using the vendored version of slice::sort from mid 2022 see here https://github.com/Voultapher/sort-research-rs/blob/d3bab202e18a2010e26674677e90ae3048721232/src/stable/rust_std.rs

I expected to see this happen: Runtime ~1.45s

Instead, this happened: Runtime ~2s

Version it worked on

rustc 1.70.0-nightly (8be3c2bda 2023-03-24)
binary: rustc
commit-hash: 8be3c2bda6b683f87b24714ba595e8b04faef54c
commit-date: 2023-03-24
host: x86_64-unknown-linux-gnu
release: 1.70.0-nightly
LLVM version: 15.0.7

Version with regression

rustc 1.70.0-nightly (0c61c7a97 2023-03-25)
binary: rustc
commit-hash: 0c61c7a978fe9f7b77a1d667c77d2202dadd1c10
commit-date: 2023-03-25
host: x86_64-unknown-linux-gnu
release: 1.70.0-nightly
LLVM version: 16.0.0

I bisected the issue down to 0c61c7a with cargo bisect-rustc --start=2022-12-07 --end=2023-05-13 --script=./test.sh.

The same happens if you do v.sort() instead of sort_comp::stable::rust_std::sort(&mut v). So this also affects the current implementation which is very similar to my vendored version.

The text was updated successfully, but these errors were encountered:

Voultapher · 2023-05-14T15:41:53Z

Dinging into it, I think the relevant code that has this impact is in the merge function. Which has two blocks of logic one for if the left side is shorter and one if the right side is shorter. Previously both sides generated code that is branchless in terms of jumping based on the comparison result. Now the block handling left side is shorter generates a jump based on the comparison result.

LLVM 15:

LLVM 16:

This can be fixed in code by using 'proper' branchless code:

let to_copy = if is_less(&*right, &**left) {
    get_and_increment(&mut right)
} else {
    get_and_increment(left)
};
ptr::copy_nonoverlapping(to_copy, get_and_increment(out), 1);

->

let is_l = is_less(&*right, &**left);
let to_copy = if is_l { right } else { *left };
ptr::copy_nonoverlapping(to_copy, get_and_increment(out), 1);
right = right.add(is_l as usize);
*left = left.add(!is_l as usize);

This would fix the stdlib, but it would still mean a regression for any other code out there that relies on this.

workingjubilee · 2023-05-14T22:57:41Z

I feel like I should attach the usual disclaimer that this kind of optimization can never be considered fully stable, specifically LLVM has various heuristics that affect whether it compiles a segment of code to use a "branchless" CMOV-alike or use a jump, however that doesn't mean there's nothing we can do here.

@rustbot label: +A-LLVM +A-codegen +regression-from-stable-to-beta +I-slow

apiraino · 2023-05-16T13:59:37Z

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-low +T-compiler

Voultapher · 2023-05-16T16:46:51Z

@apiraino out of curiosity. What is the future for a perf regression marked as P-low? Is there any work planned on this? Personally I was surprised that it even generated branchless code in the first place.

workingjubilee · 2023-05-16T22:07:21Z

A P-high or P-critical compiler issue means that it will be reviewed regularly by T-compiler until it is resolved, where "regularly" is a much higher frequency for P-critical. So a P-low implicitly means "this isn't so important that we need to keep putting it on the agenda". Though I should note that periodically, all T-compiler issues do get reviewed and triaged, just on a more irregular basis.

Performance work is one of those "never done" things so it's rare for it to be high priority in general, which isn't the same as "no one will work on it". It just means it would be addressed in a somewhat irregular fashion.

Voultapher · 2023-05-17T17:56:04Z

I'm not sure I understand the difference between "never done" and "no one will work on it".

tavianator · 2023-05-17T18:24:58Z

@Voultapher "Never done" as in "never finished" not "never worked on"

Voultapher · 2023-05-17T19:00:53Z

Ohh, thanks for clarifying. Anyway from my perspective me and other people I've talked to were surprised this even generated branchless code in the first place, I'd be fine with closing this issue as won't fix.

apiraino · 2023-05-17T22:02:32Z

since you @Voultapher have opened a PR (thanks a lot for that!), it makes totally sense to try fixing it, no matter the priority we assign to issues.

Voultapher · 2023-05-18T09:00:05Z

I think the issue can be split into two parts:

A) A significant regression in slice::sort.
B) LLVM no longer generating branchless code for a certain pattern.

I think A should definitely be fixed, and that's what my PR is for. B however I'd argue doesn't really need fixing, because the code in question was not obviously branchless.

Until LLVM 16 the code in the `slice::sort` `merge` function generated branchless code. With the upgrade to LLVM 16 in rustc 1.70 this property was broken. See rust-lang/rust#111559 for more info. It's annoying for comparison reasons to have the same code but with significantly different code-gen with newer versions. The updated merge function generates pretty much the same code as the previous one did with older rustc versions.

Voultapher added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels May 14, 2023

rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label May 14, 2023

rustbot added P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 16, 2023

Voultapher mentioned this issue May 16, 2023

Use code with reliable branchless code-gen for slice::sort merge #111646

Merged

workingjubilee removed the regression-untriaged Untriaged performance or correctness regression. label May 16, 2023

bors closed this as completed in fe76e14 May 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf regression with LLVM 16 for `slice::sort` #111559

Perf regression with LLVM 16 for `slice::sort` #111559

Voultapher commented May 14, 2023 •

edited

Loading

Voultapher commented May 14, 2023

workingjubilee commented May 14, 2023 •

edited

Loading

apiraino commented May 16, 2023

Voultapher commented May 16, 2023

workingjubilee commented May 16, 2023

Voultapher commented May 17, 2023

tavianator commented May 17, 2023

Voultapher commented May 17, 2023

apiraino commented May 17, 2023

Voultapher commented May 18, 2023

Perf regression with LLVM 16 for slice::sort #111559

Perf regression with LLVM 16 for slice::sort #111559

Comments

Voultapher commented May 14, 2023 • edited Loading

Code

Version it worked on

Version with regression

Voultapher commented May 14, 2023

workingjubilee commented May 14, 2023 • edited Loading

apiraino commented May 16, 2023

Voultapher commented May 16, 2023

workingjubilee commented May 16, 2023

Voultapher commented May 17, 2023

tavianator commented May 17, 2023

Voultapher commented May 17, 2023

apiraino commented May 17, 2023

Voultapher commented May 18, 2023

Perf regression with LLVM 16 for `slice::sort` #111559

Perf regression with LLVM 16 for `slice::sort` #111559

Voultapher commented May 14, 2023 •

edited

Loading

workingjubilee commented May 14, 2023 •

edited

Loading