faster `rand(1:n)` by outlining unlikely branch #58089

rfourquet · 2025-04-12T13:56:51Z

It's hard to measure the improvement with single calls, but this change substantially improve the situation in #50509, such that these new versions of randperm etc are almost always faster (even for big n).

Here are some example benchmarks. Note that biggest ranges like UInt(0):UInt(2)^64-2 are the ones exercising the most the "unlikely" branch:

julia> const xx = Xoshiro(); using Chairmarks

julia> rands(rng, ns) = for i=ns
    rand(rng, zero(i):i)
end

julia> rands(ns) = for i=ns
    rand(zero(i):i)
end

julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1)
(1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR
(2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master

julia> @b  rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG
(2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR
(3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master

julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2)
(95.315 ns, 132.144 ns, 7.486 ns) # PR
(217.169 ns, 143.519 ns, 8.065 ns) # master

julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2)
(235.882 ns, 162.809 ns, 10.603 ns) # PR
(202.524 ns, 132.869 ns, 7.631 ns) # master

So it's a bit tricky: with an explicit RNG, rands(xx, 1:100) becomes much faster, but without, rands(1:100) becomes slower.

Assuming #50509 was merged, shuffle is a good function to benchmark rand(1:n), and the changes here consistently improve performance, as shown by this graph (when TaskLocalRNG is mentioned, it means no RNG argument was passed to the function):

So although there can be slowdowns, I think this change is overall a win.

oscardssmith · 2025-04-14T17:26:37Z

The cmdlineargs test failures seem likely unrelated, so I've updated the branch to hope that we get a clean run.

rfourquet · 2025-04-16T16:07:14Z

The cmdlineargs test failures seem likely unrelated

So they actually consistently fail. For example at line 497 of the "test/cmdlineargs.jl" file, there is @test occursin(expected, got) failing with this tuple being equal to

 ("SF:/data/src/julia/test/testhelpers/coverage_file.jl\nDA:3,1\nDA:4,1\nDA:5,0\nDA:7,1\nDA:8,1\nDA:9,3\nDA:10,5\nDA:11,1\nDA:12,1\nDA:14,0\nDA:17,1\nDA:18,1\nDA:19,1\nDA:20,1\nDA:22,1\nLH:13\nLF:15\nend_of_record\n",

 "SF:/data/src/julia/test/testhelpers/coverage_file.jl\nDA:3,1\nDA:4,2\nDA:5,0\nDA:7,1\nDA:8,1\nDA:9,3\nDA:10,5\nDA:11,1\nDA:12,1\nDA:14,0\nDA:17,1\nDA:18,1\nDA:19,1\nDA:20,1\nDA:22,1\nLH:13\nLF:15\nend_of_record\nSF:/data/src/julia/usr/share/julia/stdlib/v1.13/Random/src/Random.jl\nDA:141,1\nDA:255,2\nDA:258,1\nDA:260,2\nLH:4\nLF:4\nend_of_record\nSF:/data/src/julia/usr/share/julia/stdlib/v1.13/Random/src/Xoshiro.jl\nDA:214,1\nDA:215,1\nDA:216,1\nDA:217,1\nDA:218,1\nDA:219,0\nDA:220,0\nDA:222,0\nDA:226,1\nDA:227,1\nDA:264,1\nDA:265,1\nDA:266,1\nDA:267,1\nDA:268,1\nDA:269,1\nDA:270,1\nDA:271,1\nDA:272,1\nDA:273,1\nDA:274,1\nDA:275,0\nLH:18\nLF:22\nend_of_record\nSF:/data/src/julia/usr/share/julia/stdlib/v1.13/Random/src/generation.jl\nDA:221,1\nDA:360,1\nDA:364,1\nDA:365,1\nDA:366,1\nDA:367,0\nDA:368,1\nDA:371,1\nDA:375,1\nDA:376,1\nDA:377,1\nDA:378,2\nDA:381,1\nLH:12\nLF:13\nend_of_record\n")

The 2nd "DA:" line do not match: it's "4,1" in expected, but "4,2" in got. I've no idea what this means, are these coverage numbers? In this instance, can I just regenerate the reference "converage_file.info" ?

vtjnash · 2025-04-16T16:27:23Z

DA:4,2 means line 4 of file SF got executed twice (instead of once). It is probably fine to just change the reference file to match since you changed inlining, it might now see the line as being run at different points and counting that as being reached twice

oscardssmith · 2025-04-23T14:52:50Z

@rfourquet do you need help fixing the test? I don't want this branch to be forgotten as it is a quite nice perf win.

rfourquet · 2025-04-23T20:57:21Z

Thanks for the bump, it looks like (most of) the tests pass now, with the reference file changed.

oscardssmith · 2025-04-24T13:09:44Z

thanks!

@b

It's hard to measure the improvement with single calls, but this change substantially improve the situation in JuliaLang#50509, such that these new versions of `randperm` etc are almost always faster (even for big n). Here are some example benchmarks. Note that biggest ranges like `UInt(0):UInt(2)^64-2` are the ones exercising the most the "unlikely" branch: ```julia julia> const xx = Xoshiro(); using Chairmarks julia> rands(rng, ns) = for i=ns rand(rng, zero(i):i) end julia> rands(ns) = for i=ns rand(zero(i):i) end julia> @b rand(xx, 1:100), rand(xx, UInt(0):UInt(2)^63), rand(xx, UInt(0):UInt(2)^64-3), rand(xx, UInt(0):UInt(2)^64-2), rand(xx, UInt(0):UInt(2)^64-1) (1.968 ns, 8.000 ns, 3.321 ns, 3.321 ns, 2.152 ns) # PR (2.151 ns, 7.284 ns, 2.151 ns, 2.151 ns, 2.151 ns) # master julia> @b rand(1:100), rand(UInt(0):UInt(2)^63), rand(UInt(0):UInt(2)^64-3), rand(UInt(0):UInt(2)^64-2),rand(UInt(0):UInt(2)^64-1) # with TaskLocalRNG (2.148 ns, 7.837 ns, 3.317 ns, 3.085 ns, 1.957 ns) # PR (3.128 ns, 8.275 ns, 3.324 ns, 3.324 ns, 1.955 ns) # master julia> rands(xx, 1:100), rands(xx, UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(xx, UInt(2)^64-4:UInt(2)^64-2) (95.315 ns, 132.144 ns, 7.486 ns) # PR (217.169 ns, 143.519 ns, 8.065 ns) # master julia> rands(1:100), rands(UInt(2)^62:UInt(2)^59:UInt(2)^64-1), rands(UInt(2)^64-4:UInt(2)^64-2) (235.882 ns, 162.809 ns, 10.603 ns) # PR (202.524 ns, 132.869 ns, 7.631 ns) # master ``` So it's a bit tricky: with an explicit RNG, `rands(xx, 1:100)` becomes much faster, but without, `rands(1:100)` becomes slower. Assuming JuliaLang#50509 was merged, `shuffle` is a good function to benchmark `rand(1:n)`, and the changes here consistently improve performance, as shown by this graph (when `TaskLocalRNG` is mentioned, it means *no* RNG argument was passed to the function): ![new-rand-ndl](https://github.com/user-attachments/assets/b7a6229a-f5d9-408e-9102-4056b796d22c) So although there can be slowdowns, I think this change is overall a win.

In #58089, this method took a small performance hit in some contexts. It turns out that by outlining unlikely branch which throw on empty ranges, his hit can be recovered. In #50509 (comment), a graph of the performance improvement of the "speed-up randperm by using our current rand(1:n)" was posted, but I realized it was only true when calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it was overall slower for `TaskLocalRNG()` for very big arrays (but still faster otherwise). An alternative to these `@inline` annotation is to outline `throw` like here, for equivalent benefits as `@inline` in that `randperm` PR. Assuming that PR is merged, this PR improves roughly performance by 2x for `TaskLocalRNG()` (no change for other RNGs).

In #58089, this method took a small performance hit in some contexts. It turns out that by outlining the unlikely branch which throws on empty ranges, this hit can be recovered. In #50509 (comment), a graph of the performance improvement of the "speed-up randperm by using our current rand(1:n)" was posted, but I realized it was only true when calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it was overall slower for `TaskLocalRNG()` for very big arrays (but still faster otherwise). An alternative to these `@inline` annotation is to outline `throw` like here, for equivalent benefits as `@inline` in that `randperm` PR. Assuming that PR is merged, this PR improves roughly performance by 2x for `TaskLocalRNG()` (no change for other RNGs): ![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7) While at it, I outlined a bunch of other unliky throwing branches. After that, #50509 can probably be merged, finally!

) In JuliaLang#58089, this method took a small performance hit in some contexts. It turns out that by outlining the unlikely branch which throws on empty ranges, this hit can be recovered. In JuliaLang#50509 (comment), a graph of the performance improvement of the "speed-up randperm by using our current rand(1:n)" was posted, but I realized it was only true when calls to `rand(1:n)` were prefixed by `@inline`; without `@inline` it was overall slower for `TaskLocalRNG()` for very big arrays (but still faster otherwise). An alternative to these `@inline` annotation is to outline `throw` like here, for equivalent benefits as `@inline` in that `randperm` PR. Assuming that PR is merged, this PR improves roughly performance by 2x for `TaskLocalRNG()` (no change for other RNGs): ![new-shuffle-outlinethrow](https://github.com/user-attachments/assets/8c0d4740-3bb4-4bcf-a49d-9af09426bec7) While at it, I outlined a bunch of other unliky throwing branches. After that, JuliaLang#50509 can probably be merged, finally!

rfourquet added performance Must go faster randomness Random number generation and the Random stdlib labels Apr 12, 2025

rfourquet mentioned this pull request Apr 12, 2025

speed-up randperm by using our current rand(1:n) #50509

Open

oscardssmith approved these changes Apr 14, 2025

View reviewed changes

rfourquet added 3 commits April 23, 2025 18:16

faster rand(1:n) by separating out unlikely branch in a function

ab35fb5

slightly faster version

6853915

fix cmdlineargs tests, and minor updates

1df417e

rfourquet force-pushed the rf/faster-ndl branch from e2ccf9d to 1df417e Compare April 23, 2025 16:17

oscardssmith merged commit 6498664 into master Apr 24, 2025
7 checks passed

oscardssmith deleted the rf/faster-ndl branch April 24, 2025 13:09

rfourquet mentioned this pull request May 2, 2025

faster rand(TaskLocalRNG(), 1:n) by outlining throw #58306

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

faster `rand(1:n)` by outlining unlikely branch #58089

faster `rand(1:n)` by outlining unlikely branch #58089

Uh oh!

rfourquet commented Apr 12, 2025 •

edited

Loading

Uh oh!

oscardssmith commented Apr 14, 2025

Uh oh!

rfourquet commented Apr 16, 2025 •

edited

Loading

Uh oh!

vtjnash commented Apr 16, 2025

Uh oh!

oscardssmith commented Apr 23, 2025

Uh oh!

rfourquet commented Apr 23, 2025

Uh oh!

Uh oh!

oscardssmith commented Apr 24, 2025

Uh oh!

Uh oh!

Uh oh!

faster rand(1:n) by outlining unlikely branch #58089

faster rand(1:n) by outlining unlikely branch #58089

Uh oh!

Conversation

rfourquet commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oscardssmith commented Apr 14, 2025

Uh oh!

rfourquet commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vtjnash commented Apr 16, 2025

Uh oh!

oscardssmith commented Apr 23, 2025

Uh oh!

rfourquet commented Apr 23, 2025

Uh oh!

Uh oh!

oscardssmith commented Apr 24, 2025

Uh oh!

Uh oh!

faster `rand(1:n)` by outlining unlikely branch #58089

faster `rand(1:n)` by outlining unlikely branch #58089

rfourquet commented Apr 12, 2025 •

edited

Loading

rfourquet commented Apr 16, 2025 •

edited

Loading