Skip to content

Perf regression with LLVM 16 for slice::sort #111559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Voultapher opened this issue May 14, 2023 · 10 comments · Fixed by #111646
Closed

Perf regression with LLVM 16 for slice::sort #111559

Voultapher opened this issue May 14, 2023 · 10 comments · Fixed by #111646
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-low Low priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Voultapher
Copy link
Contributor

Voultapher commented May 14, 2023

Code

I tried this code, sorting 30 million random u64:

fn main() {
    use std::time::Instant;

    let mut v = sort_test_tools::patterns::random(30_000_000)
        .into_iter()
        .map(|e| e as u64)
        .collect::<Vec<u64>>();

    let start = Instant::now();
    sort_comp::stable::rust_std::sort(&mut v);

    let diff = start.elapsed().as_millis();

    if diff >= 1720 {
        panic!();
    }
}

On this machine:

Linux 6.3.1
AMD Ryzen 9 5900X 12-Core Processor (Zen3 micro-architecture)

Using the vendored version of slice::sort from mid 2022 see here https://github.com/Voultapher/sort-research-rs/blob/d3bab202e18a2010e26674677e90ae3048721232/src/stable/rust_std.rs

I expected to see this happen: Runtime ~1.45s

Instead, this happened: Runtime ~2s

Version it worked on

rustc 1.70.0-nightly (8be3c2bda 2023-03-24)
binary: rustc
commit-hash: 8be3c2bda6b683f87b24714ba595e8b04faef54c
commit-date: 2023-03-24
host: x86_64-unknown-linux-gnu
release: 1.70.0-nightly
LLVM version: 15.0.7

Version with regression

rustc 1.70.0-nightly (0c61c7a97 2023-03-25)
binary: rustc
commit-hash: 0c61c7a978fe9f7b77a1d667c77d2202dadd1c10
commit-date: 2023-03-25
host: x86_64-unknown-linux-gnu
release: 1.70.0-nightly
LLVM version: 16.0.0

I bisected the issue down to 0c61c7a with cargo bisect-rustc --start=2022-12-07 --end=2023-05-13 --script=./test.sh.


The same happens if you do v.sort() instead of sort_comp::stable::rust_std::sort(&mut v). So this also affects the current implementation which is very similar to my vendored version.

@Voultapher Voultapher added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels May 14, 2023
@rustbot rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label May 14, 2023
@Voultapher
Copy link
Contributor Author

Dinging into it, I think the relevant code that has this impact is in the merge function. Which has two blocks of logic one for if the left side is shorter and one if the right side is shorter. Previously both sides generated code that is branchless in terms of jumping based on the comparison result. Now the block handling left side is shorter generates a jump based on the comparison result.

LLVM 15:

image

LLVM 16:

image

This can be fixed in code by using 'proper' branchless code:

let to_copy = if is_less(&*right, &**left) {
    get_and_increment(&mut right)
} else {
    get_and_increment(left)
};
ptr::copy_nonoverlapping(to_copy, get_and_increment(out), 1);

->

let is_l = is_less(&*right, &**left);
let to_copy = if is_l { right } else { *left };
ptr::copy_nonoverlapping(to_copy, get_and_increment(out), 1);
right = right.add(is_l as usize);
*left = left.add(!is_l as usize);

This would fix the stdlib, but it would still mean a regression for any other code out there that relies on this.

@workingjubilee
Copy link
Member

workingjubilee commented May 14, 2023

I feel like I should attach the usual disclaimer that this kind of optimization can never be considered fully stable, specifically LLVM has various heuristics that affect whether it compiles a segment of code to use a "branchless" CMOV-alike or use a jump, however that doesn't mean there's nothing we can do here.

@rustbot label: +A-LLVM +A-codegen +regression-from-stable-to-beta +I-slow

@rustbot rustbot added A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. regression-from-stable-to-beta Performance or correctness regression from stable to beta. labels May 14, 2023
@apiraino
Copy link
Contributor

WG-prioritization assigning priority (Zulip discussion).

@rustbot label -I-prioritize +P-low +T-compiler

@rustbot rustbot added P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed I-prioritize Issue: Indicates that prioritization has been requested for this issue. labels May 16, 2023
@Voultapher
Copy link
Contributor Author

@apiraino out of curiosity. What is the future for a perf regression marked as P-low? Is there any work planned on this? Personally I was surprised that it even generated branchless code in the first place.

@workingjubilee workingjubilee removed the regression-untriaged Untriaged performance or correctness regression. label May 16, 2023
@workingjubilee
Copy link
Member

A P-high or P-critical compiler issue means that it will be reviewed regularly by T-compiler until it is resolved, where "regularly" is a much higher frequency for P-critical. So a P-low implicitly means "this isn't so important that we need to keep putting it on the agenda". Though I should note that periodically, all T-compiler issues do get reviewed and triaged, just on a more irregular basis.

Performance work is one of those "never done" things so it's rare for it to be high priority in general, which isn't the same as "no one will work on it". It just means it would be addressed in a somewhat irregular fashion.

@Voultapher
Copy link
Contributor Author

I'm not sure I understand the difference between "never done" and "no one will work on it".

@tavianator
Copy link
Contributor

@Voultapher "Never done" as in "never finished" not "never worked on"

@Voultapher
Copy link
Contributor Author

Ohh, thanks for clarifying. Anyway from my perspective me and other people I've talked to were surprised this even generated branchless code in the first place, I'd be fine with closing this issue as won't fix.

@apiraino
Copy link
Contributor

since you @Voultapher have opened a PR (thanks a lot for that!), it makes totally sense to try fixing it, no matter the priority we assign to issues.

@Voultapher
Copy link
Contributor Author

I think the issue can be split into two parts:

A) A significant regression in slice::sort.
B) LLVM no longer generating branchless code for a certain pattern.

I think A should definitely be fixed, and that's what my PR is for. B however I'd argue doesn't really need fixing, because the code in question was not obviously branchless.

@bors bors closed this as completed in fe76e14 May 21, 2023
Voultapher added a commit to Voultapher/sort-research-rs that referenced this issue Oct 5, 2023
Until LLVM 16 the code in the `slice::sort` `merge` function generated
branchless code. With the upgrade to LLVM 16 in rustc 1.70 this property was
broken. See rust-lang/rust#111559 for more info. It's
annoying for comparison reasons to have the same code but with significantly
different code-gen with newer versions. The updated merge function generates
pretty much the same code as the previous one did with older rustc versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-slow Issue: Problems and improvements with respect to performance of generated code. P-low Low priority regression-from-stable-to-beta Performance or correctness regression from stable to beta. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants