Skip to content

faster rand(1:n) by outlining unlikely branch #58089

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 24, 2025
Merged

Conversation

rfourquet
Copy link
Member

@rfourquet rfourquet commented Apr 12, 2025

It's hard to measure the improvement with single calls, but this change substantially improve the situation in #50509, such that these new versions of randperm etc are almost always faster (even for big n).

Here are some example benchmarks. Note that biggest ranges like UInt(0):UInt(2)^64-2 are the ones exercising the most the "unlikely" branch:

julia> const xx = Xoshiro(); using Chairmarks

julia> rands(rng, ns) = for i=ns
    rand(rng, zero(i):i)
end

julia> rands(ns) = for i=ns
    rand(zero(i):i)
end

julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1)
(1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR
(2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master

julia> @b  rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG
(2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR
(3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master

julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2)
(95.315 ns, 132.144 ns, 7.486 ns) # PR
(217.169 ns, 143.519 ns, 8.065 ns) # master

julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2)
(235.882 ns, 162.809 ns, 10.603 ns) # PR
(202.524 ns, 132.869 ns, 7.631 ns) # master

So it's a bit tricky: with an explicit RNG, rands(xx, 1:100) becomes much faster, but without, rands(1:100) becomes slower.

Assuming #50509 was merged, shuffle is a good function to benchmark rand(1:n), and the changes here consistently improve performance, as shown by this graph (when TaskLocalRNG is mentioned, it means no RNG argument was passed to the function):
new-rand-ndl

So although there can be slowdowns, I think this change is overall a win.

@rfourquet rfourquet added performance Must go faster randomness Random number generation and the Random stdlib labels Apr 12, 2025
@oscardssmith
Copy link
Member

The cmdlineargs test failures seem likely unrelated, so I've updated the branch to hope that we get a clean run.

@rfourquet
Copy link
Member Author

rfourquet commented Apr 16, 2025

The cmdlineargs test failures seem likely unrelated

So they actually consistently fail. For example at line 497 of the "test/cmdlineargs.jl" file, there is @test occursin(expected, got) failing with this tuple being equal to

 ("SF:/data/src/julia/test/testhelpers/coverage_file.jl\nDA:3,1\nDA:4,1\nDA:5,0\nDA:7,1\nDA:8,1\nDA:9,3\nDA:10,5\nDA:11,1\nDA:12,1\nDA:14,0\nDA:17,1\nDA:18,1\nDA:19,1\nDA:20,1\nDA:22,1\nLH:13\nLF:15\nend_of_record\n",

 "SF:/data/src/julia/test/testhelpers/coverage_file.jl\nDA:3,1\nDA:4,2\nDA:5,0\nDA:7,1\nDA:8,1\nDA:9,3\nDA:10,5\nDA:11,1\nDA:12,1\nDA:14,0\nDA:17,1\nDA:18,1\nDA:19,1\nDA:20,1\nDA:22,1\nLH:13\nLF:15\nend_of_record\nSF:/data/src/julia/usr/share/julia/stdlib/v1.13/Random/src/Random.jl\nDA:141,1\nDA:255,2\nDA:258,1\nDA:260,2\nLH:4\nLF:4\nend_of_record\nSF:/data/src/julia/usr/share/julia/stdlib/v1.13/Random/src/Xoshiro.jl\nDA:214,1\nDA:215,1\nDA:216,1\nDA:217,1\nDA:218,1\nDA:219,0\nDA:220,0\nDA:222,0\nDA:226,1\nDA:227,1\nDA:264,1\nDA:265,1\nDA:266,1\nDA:267,1\nDA:268,1\nDA:269,1\nDA:270,1\nDA:271,1\nDA:272,1\nDA:273,1\nDA:274,1\nDA:275,0\nLH:18\nLF:22\nend_of_record\nSF:/data/src/julia/usr/share/julia/stdlib/v1.13/Random/src/generation.jl\nDA:221,1\nDA:360,1\nDA:364,1\nDA:365,1\nDA:366,1\nDA:367,0\nDA:368,1\nDA:371,1\nDA:375,1\nDA:376,1\nDA:377,1\nDA:378,2\nDA:381,1\nLH:12\nLF:13\nend_of_record\n")

The 2nd "DA:" line do not match: it's "4,1" in expected, but "4,2" in got. I've no idea what this means, are these coverage numbers? In this instance, can I just regenerate the reference "converage_file.info" ?

@vtjnash
Copy link
Member

vtjnash commented Apr 16, 2025

DA:4,2 means line 4 of file SF got executed twice (instead of once). It is probably fine to just change the reference file to match since you changed inlining, it might now see the line as being run at different points and counting that as being reached twice

@oscardssmith
Copy link
Member

@rfourquet do you need help fixing the test? I don't want this branch to be forgotten as it is a quite nice perf win.

@rfourquet
Copy link
Member Author

Thanks for the bump, it looks like (most of) the tests pass now, with the reference file changed.

@oscardssmith oscardssmith merged commit 6498664 into master Apr 24, 2025
7 checks passed
@oscardssmith oscardssmith deleted the rf/faster-ndl branch April 24, 2025 13:09
@oscardssmith
Copy link
Member

thanks!

LebedevRI pushed a commit to LebedevRI/julia that referenced this pull request May 2, 2025
It's hard to measure the improvement with single calls, but this change
substantially improve the situation in JuliaLang#50509, such that these new
versions of `randperm` etc are almost always faster (even for big n).

Here are some example benchmarks. Note that biggest ranges like
`UInt(0):UInt(2)^64-2` are the ones exercising the most the "unlikely"
branch:
```julia
julia> const xx = Xoshiro(); using Chairmarks

julia> rands(rng, ns) = for i=ns
    rand(rng, zero(i):i)
end

julia> rands(ns) = for i=ns
    rand(zero(i):i)
end

julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1)
(1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR
(2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master

julia> @b  rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG
(2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR
(3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master

julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2)
(95.315 ns, 132.144 ns, 7.486 ns) # PR
(217.169 ns, 143.519 ns, 8.065 ns) # master

julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2)
(235.882 ns, 162.809 ns, 10.603 ns) # PR
(202.524 ns, 132.869 ns, 7.631 ns) # master
```

So it's a bit tricky: with an explicit RNG, `rands(xx, 1:100)` becomes
much faster, but without, `rands(1:100)` becomes slower.

Assuming JuliaLang#50509 was merged, `shuffle` is a good function to benchmark
`rand(1:n)`, and the changes here consistently improve performance, as
shown by this graph (when `TaskLocalRNG` is mentioned, it means *no* RNG
argument was passed to the function):

![new-rand-ndl](https://github.com/user-attachments/assets/b7a6229a-f5d9-408e-9102-4056b796d22c)

So although there can be slowdowns, I think this change is overall a
win.
rfourquet added a commit that referenced this pull request May 2, 2025
In #58089, this method took a small performance hit in some contexts.
It turns out that by outlining unlikely branch which throw on empty
ranges, his hit can be recovered.

In #50509 (comment),
a graph of the performance improvement of the "speed-up randperm by using
our current rand(1:n)" was posted, but I realized it was only true when
calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it was
overall slower for `TaskLocalRNG()` for very big arrays (but still faster
otherwise).

An alternative to these `@inline` annotation is to outline `throw` like here,
for equivalent benefits as `@inline` in that `randperm` PR.

Assuming that PR is merged, this PR improves roughly performance by 2x for
`TaskLocalRNG()` (no change for other RNGs).
rfourquet added a commit that referenced this pull request May 9, 2025
In #58089, this method took a small performance hit in some contexts. It
turns out that by outlining the unlikely branch which throws on empty
ranges, this hit can be recovered.

In
#50509 (comment), a
graph of the performance improvement of the "speed-up randperm by using
our current rand(1:n)" was posted, but I realized it was only true when
calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it
was overall slower for `TaskLocalRNG()` for very big arrays (but still
faster otherwise).

An alternative to these `@inline` annotation is to outline `throw` like
here, for equivalent benefits as `@inline` in that `randperm` PR.

Assuming that PR is merged, this PR improves roughly performance by 2x
for `TaskLocalRNG()` (no change for other RNGs):


![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7)

While at it, I outlined a bunch of other unliky throwing branches.

After that, #50509 can probably be merged, finally!
charleskawczynski pushed a commit to charleskawczynski/julia that referenced this pull request May 12, 2025
)

In JuliaLang#58089, this method took a small performance hit in some contexts. It
turns out that by outlining the unlikely branch which throws on empty
ranges, this hit can be recovered.

In
JuliaLang#50509 (comment), a
graph of the performance improvement of the "speed-up randperm by using
our current rand(1:n)" was posted, but I realized it was only true when
calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it
was overall slower for `TaskLocalRNG()` for very big arrays (but still
faster otherwise).

An alternative to these `@inline` annotation is to outline `throw` like
here, for equivalent benefits as `@inline` in that `randperm` PR.

Assuming that PR is merged, this PR improves roughly performance by 2x
for `TaskLocalRNG()` (no change for other RNGs):


![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7)

While at it, I outlined a bunch of other unliky throwing branches.

After that, JuliaLang#50509 can probably be merged, finally!
charleskawczynski pushed a commit to charleskawczynski/julia that referenced this pull request May 12, 2025
)

In JuliaLang#58089, this method took a small performance hit in some contexts. It
turns out that by outlining the unlikely branch which throws on empty
ranges, this hit can be recovered.

In
JuliaLang#50509 (comment), a
graph of the performance improvement of the "speed-up randperm by using
our current rand(1:n)" was posted, but I realized it was only true when
calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it
was overall slower for `TaskLocalRNG()` for very big arrays (but still
faster otherwise).

An alternative to these `@inline` annotation is to outline `throw` like
here, for equivalent benefits as `@inline` in that `randperm` PR.

Assuming that PR is merged, this PR improves roughly performance by 2x
for `TaskLocalRNG()` (no change for other RNGs):

![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7)

While at it, I outlined a bunch of other unliky throwing branches.

After that, JuliaLang#50509 can probably be merged, finally!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster randomness Random number generation and the Random stdlib
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants