"Free" dynamic dispatch for core SIMD? #178

emmatyping · 2021-10-29T22:12:06Z

Hi! I'm very excited by the work on portable/core simd. Thank you for all of your work :)

I was looking over the API, and I was thinking it should be possible to get "free" runtime dispatch with a proc macro on top of portable SIMD, which would make portability really easy!

I'm thinking something like (more of a sketch than concrete proposal):

#[runtime_dispatch(sse,avx,avx2)]
fn some_simd_fn(...) {
    // uses portable simd API
}

This would generate code much like the example in the core::arch docs here: https://doc.rust-lang.org/core/arch/index.html#dynamic-cpu-feature-detection, where it would generate versions of the function for each target_feature, then make the actual main function dynamically dispatch to each implementation.

Anyway, I thought I'd open this and get your thoughts.

The text was updated successfully, but these errors were encountered:

thomcc · 2021-10-29T22:22:46Z

https://crates.io/crates/multiversion exists for this currently, which is maintained by (our very own) @calebzulawski.

I'd like to see some mechanism of this in the stdlib, but there's a lot of... trickiness to it, which I think makes it better off in an external crate.

A big issue is that technically you can't do feature checking without libstd (e.g. you can't do it from no_std), since on many targets it can require OS-dependent machinery. On x86 though, you can just use cpuid (after a check to ensure it's available), which can be done in an OS-independent manner, e.g. from no_std.

This means that either (ignoring the bikeshed about the name):

#[runtime_dispatch(...)] wouldn't work from no_std at all.
#[runtime_dispatch(...)] would work from no_std on x86 and other targets that can do this in a target-independent manner, and would require std:: on other targets.

Number 2 goes against the general way std::arch's feature detection is currently designed, so I... don't think it would happen. That said, these restrictions don't apply to a third party crate.

thomcc · 2021-10-29T22:41:37Z

An argument in favor of doing it in the compiler (e.g. having something like #[runtime_dispatch] be a builtin attribute macro) is that it knows all the compile flags, which can change whether or not it's better to implement this using conditionals vs an indirect function call (probably¹).

This also means it could avoid going through any indirection in certain cases (for example, if it already knows the target_feature set), and more generally, the compiler understand and "see through" the abstraction if it were builtin, whereas (in practice, if not in theory) a library implementation of this will tend to function as an optimization barrier.

That said, I suspect proper usage of this kind of thing would be on larger functions in order to to minimize the cost of the dispatch, and on these sorts of functions this kind of thing won't matter as much.

At least in theory this can matter, especially for Spectre mitigations like retpolines.

That said, in practice personally I haven't seen this, despite trying to find evidence of it — I've benchmarked a fair amount (trying to determine if it was worth it send a PR changing the implementation of the ifunc emulation in https://github.com/BurntSushi/memchr/blob/8e1da98fee06d66c13e66c330e3a3dd6ccf0e3a0/src/memchr/x86/mod.rs), and from what I've seen it was always decidedly faster to go through an indirect function, regardless of what flags I enabled.

That said, there was a thread on zulip in the past where @joshtriplett and @calebzulawski indicated that for a certain set of flags, it's faster to implement this as a branch, and so probably there's some hardware where this is true, at least for some functions. ↩

emmatyping · 2021-10-30T00:02:29Z

Oh huh, I should definitely try multiversion! Looks like exactly what I want :)

A big issue is that technically you can't do feature checking without libstd

Yeah I figured the runtime detection might be trickier than I was expecting...

Is there much of a reason to have no_std if it is x86 only? I'm not sure if I see the use case of no_std on x86 only...

That said, I suspect proper usage of this kind of thing would be on larger functions in order to to minimize the cost of the dispatch, and on these sorts of functions this kind of thing won't matter as much.

I could also see it being useful for small functions that are kernels for larger functions, but maybe that'd get inlined anyway.

thomcc · 2021-10-30T00:07:46Z

Is there much of a reason to have no_std if it is x86 only

You might only support the runtime dispatch on x86, and use the statically known feature for other targets (for example, thumbv7neon-linux-androideabi exists basically entirely so that target_feature="neon" (and probably some others) are known statically).

... But perhaps you're right, in general, and this isn't as much of an issue as it seems like it might be to me. I don't really love having a language feature that doesn't work for no_std though, especially when there are cases where it could.

I could also see it being useful for small functions that are kernels for larger functions, but maybe that'd get inlined anyway.

It probably wouldn't with the multiversion crate (or any library implementation), but this is what I meant by the compiler being able to avoid the dispatch "if it already knows the target_feature set" (and yeah, I agree it's somewhat desirable).

calebzulawski · 2021-10-30T00:09:07Z

I could also see it being useful for small functions that are kernels for larger functions, but maybe that'd get inlined anyway.

For small functions you shouldn't really be dispatching them at runtime--its probably better to dispatch once at a larger scope.

calebzulawski · 2021-10-30T00:13:16Z

It probably wouldn't with the multiversion crate (or any library implementation), but this is what I meant by the compiler being able to avoid the dispatch "if it already knows the target_feature set" (and yeah, I agree it's somewhat desirable).

With multiversion, LLVM will remove the branch when possible. I've thought about adding in an explicit check that if the first variant is present at compile time, the entire dispatch mechanism is skipped.

workingjubilee · 2021-11-01T16:24:15Z

Properly speaking, you should not just check cpuid, but also xgetbv. Architectural state requires OS support to preserve it on x86 (this is correctly handled by Rust's own feature detection code).

emmatyping added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Oct 29, 2021

yjhn mentioned this issue Jun 8, 2022

Implement a faster stable sort algorithm rust-lang/rust#90545

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Free" dynamic dispatch for core SIMD? #178

"Free" dynamic dispatch for core SIMD? #178

emmatyping commented Oct 29, 2021

thomcc commented Oct 29, 2021

thomcc commented Oct 29, 2021

emmatyping commented Oct 30, 2021

thomcc commented Oct 30, 2021 •

edited

Loading

calebzulawski commented Oct 30, 2021

calebzulawski commented Oct 30, 2021

workingjubilee commented Nov 1, 2021 •

edited

Loading

"Free" dynamic dispatch for core SIMD? #178

"Free" dynamic dispatch for core SIMD? #178

Comments

emmatyping commented Oct 29, 2021

thomcc commented Oct 29, 2021

thomcc commented Oct 29, 2021

Footnotes

emmatyping commented Oct 30, 2021

thomcc commented Oct 30, 2021 • edited Loading

calebzulawski commented Oct 30, 2021

calebzulawski commented Oct 30, 2021

workingjubilee commented Nov 1, 2021 • edited Loading

thomcc commented Oct 30, 2021 •

edited

Loading

workingjubilee commented Nov 1, 2021 •

edited

Loading