Skip to content

"Free" dynamic dispatch for core SIMD? #178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
emmatyping opened this issue Oct 29, 2021 · 7 comments
Open

"Free" dynamic dispatch for core SIMD? #178

emmatyping opened this issue Oct 29, 2021 · 7 comments
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR

Comments

@emmatyping
Copy link

Hi! I'm very excited by the work on portable/core simd. Thank you for all of your work :)

I was looking over the API, and I was thinking it should be possible to get "free" runtime dispatch with a proc macro on top of portable SIMD, which would make portability really easy!

I'm thinking something like (more of a sketch than concrete proposal):

#[runtime_dispatch(sse,avx,avx2)]
fn some_simd_fn(...) {
    // uses portable simd API
}

This would generate code much like the example in the core::arch docs here: https://doc.rust-lang.org/core/arch/index.html#dynamic-cpu-feature-detection, where it would generate versions of the function for each target_feature, then make the actual main function dynamically dispatch to each implementation.

Anyway, I thought I'd open this and get your thoughts.

@emmatyping emmatyping added the C-feature-request Category: a feature request, i.e. not implemented / a PR label Oct 29, 2021
@thomcc
Copy link
Member

thomcc commented Oct 29, 2021

https://crates.io/crates/multiversion exists for this currently, which is maintained by (our very own) @calebzulawski.

I'd like to see some mechanism of this in the stdlib, but there's a lot of... trickiness to it, which I think makes it better off in an external crate.

A big issue is that technically you can't do feature checking without libstd (e.g. you can't do it from no_std), since on many targets it can require OS-dependent machinery. On x86 though, you can just use cpuid (after a check to ensure it's available), which can be done in an OS-independent manner, e.g. from no_std.

This means that either (ignoring the bikeshed about the name):

  1. #[runtime_dispatch(...)] wouldn't work from no_std at all.
  2. #[runtime_dispatch(...)] would work from no_std on x86 and other targets that can do this in a target-independent manner, and would require std:: on other targets.

Number 2 goes against the general way std::arch's feature detection is currently designed, so I... don't think it would happen. That said, these restrictions don't apply to a third party crate.

@thomcc
Copy link
Member

thomcc commented Oct 29, 2021

An argument in favor of doing it in the compiler (e.g. having something like #[runtime_dispatch] be a builtin attribute macro) is that it knows all the compile flags, which can change whether or not it's better to implement this using conditionals vs an indirect function call (probably1).

This also means it could avoid going through any indirection in certain cases (for example, if it already knows the target_feature set), and more generally, the compiler understand and "see through" the abstraction if it were builtin, whereas (in practice, if not in theory) a library implementation of this will tend to function as an optimization barrier.

That said, I suspect proper usage of this kind of thing would be on larger functions in order to to minimize the cost of the dispatch, and on these sorts of functions this kind of thing won't matter as much.

Footnotes

  1. At least in theory this can matter, especially for Spectre mitigations like retpolines.

    That said, in practice personally I haven't seen this, despite trying to find evidence of it — I've benchmarked a fair amount (trying to determine if it was worth it send a PR changing the implementation of the ifunc emulation in https://github.com/BurntSushi/memchr/blob/8e1da98fee06d66c13e66c330e3a3dd6ccf0e3a0/src/memchr/x86/mod.rs), and from what I've seen it was always decidedly faster to go through an indirect function, regardless of what flags I enabled.

    That said, there was a thread on zulip in the past where @joshtriplett and @calebzulawski indicated that for a certain set of flags, it's faster to implement this as a branch, and so probably there's some hardware where this is true, at least for some functions.

@emmatyping
Copy link
Author

Oh huh, I should definitely try multiversion! Looks like exactly what I want :)

A big issue is that technically you can't do feature checking without libstd

Yeah I figured the runtime detection might be trickier than I was expecting...

Is there much of a reason to have no_std if it is x86 only? I'm not sure if I see the use case of no_std on x86 only...

That said, I suspect proper usage of this kind of thing would be on larger functions in order to to minimize the cost of the dispatch, and on these sorts of functions this kind of thing won't matter as much.

I could also see it being useful for small functions that are kernels for larger functions, but maybe that'd get inlined anyway.

@thomcc
Copy link
Member

thomcc commented Oct 30, 2021

Is there much of a reason to have no_std if it is x86 only

You might only support the runtime dispatch on x86, and use the statically known feature for other targets (for example, thumbv7neon-linux-androideabi exists basically entirely so that target_feature="neon" (and probably some others) are known statically).

... But perhaps you're right, in general, and this isn't as much of an issue as it seems like it might be to me. I don't really love having a language feature that doesn't work for no_std though, especially when there are cases where it could.

I could also see it being useful for small functions that are kernels for larger functions, but maybe that'd get inlined anyway.

It probably wouldn't with the multiversion crate (or any library implementation), but this is what I meant by the compiler being able to avoid the dispatch "if it already knows the target_feature set" (and yeah, I agree it's somewhat desirable).

@calebzulawski
Copy link
Member

I could also see it being useful for small functions that are kernels for larger functions, but maybe that'd get inlined anyway.

For small functions you shouldn't really be dispatching them at runtime--its probably better to dispatch once at a larger scope.

@calebzulawski
Copy link
Member

It probably wouldn't with the multiversion crate (or any library implementation), but this is what I meant by the compiler being able to avoid the dispatch "if it already knows the target_feature set" (and yeah, I agree it's somewhat desirable).

With multiversion, LLVM will remove the branch when possible. I've thought about adding in an explicit check that if the first variant is present at compile time, the entire dispatch mechanism is skipped.

@workingjubilee
Copy link
Member

workingjubilee commented Nov 1, 2021

Properly speaking, you should not just check cpuid, but also xgetbv. Architectural state requires OS support to preserve it on x86 (this is correctly handled by Rust's own feature detection code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: a feature request, i.e. not implemented / a PR
Projects
None yet
Development

No branches or pull requests

4 participants