Vectorize optimized_portable_ops versions of portable ops? #9241

swolchok · 2025-03-13T20:50:42Z

🚀 The feature, motivation and pitch

Similarly to #8932, we should be able to conditionally compile portable ops to do some vectorization. I imagine that this would look like either passing a second lambda to our util functions, or perhaps passing template lambdas that we then could use for both some scalar T and also Vectorized<T>. The second option would require us to get an std-workalike interface to Vectorized operations so that things like exp would work seemlessly, which probably would have a similar solution to pytorch/pytorch#144495 .

RFC

As a concrete example, op_add currently calls a util workhorse function with a lambda:

    utils::apply_bitensor_elementwise_fn<CTYPE_COMPUTE, op_name>(
        [val_alpha](const CTYPE_COMPUTE val_a, const CTYPE_COMPUTE val_b) {
          return val_a + val_alpha * val_b;
        },

We could imagine instead making the call look like this, with a template lambda, so that we could seamlessly use the lambda with Vectorized:

    utils::apply_bitensor_elementwise_fn<CTYPE_COMPUTE, op_name>(
        [val_alpha](const auto val_a, const auto val_b) {
          return val_a + val_alpha * val_b;
        },

A second, harder example is op_exp:

Tensor& exp_out(KernelRuntimeContext& ctx, const Tensor& in, Tensor& out) {
  return internal::unary_ufunc_realhbbf16_to_floathbf16(std::exp, ctx, in, out);
}

I think ideally we would find a solution to the above-mentioned PyTorch issue and then write this as

Tensor& exp_out(KernelRuntimeContext& ctx, const Tensor& in, Tensor& out) {
  return internal::unary_ufunc_realhbbf16_to_floathbf16_v2([](auto x) { return c10::math::exp(x); }, ctx, in, out);
}

using a template lambda that could be instantiated with either a scalar or Vectorized, as outlined above.

cc @larryliu0820 @manuelcandales

The text was updated successfully, but these errors were encountered:

swolchok · 2025-03-18T21:00:26Z

Sketch of more detailed plan:

~~Make sure we have committed code that uses ATen Vectorized~~ (optimized/op_gelu does this already)
Broaden "optimized portable ops" detection from just ET_USE_THREADPOOL -- something needs to tell us it's OK to use Vectorized. (ET_USE_VECTORIZED? what would define it? ET_BUILDING_OPTIMIZED_PORTABLE_OPS? probably we define a set of specific macros when we build the optimized_portable_ops target) (Make PyTorch headers available in optimized_portable_kernels, define ET_USE_PYTORCH_HEADERS #9384)
Implement unary_ufunc functions using elementwise_util #9386
Specialize elementwise_util ops for the non-mixed-dtype case. If we don't, vectorization doesn't make any sense. (RFC: Specialize for non-mixed-dtype in elementwise_util #9388)
Proof-of-concept vectorization for elementwise_util ops using some sort of metaprogramming/SFINAE on their lambda (need to maintain backward compatibility with existing code)
Roll out specialized, vectorized elementwise_util ops
Establish c10::math namespace in upstream PyTorch
Sync c10::math header into ET core so that we can use c10::math namespace in ExecuTorch portable ops, which are part of core
Roll out vectorized unary_ufunc ops using c10::math namespace

swolchok · 2025-03-18T21:44:37Z

vectorization for elementwise ops

This is trickier than I remembered currently. We outline our loads in elementwise_util.h in the name of build time and code size; getting to a point where vectorization would be the next item on the list would take some time. Accordingly, I am going to skip elementwise_util.h for now and move on to unary_ufunc; will come back to elementwise afterwards.

swolchok · 2025-03-18T22:00:09Z

skip elementwise_util.h for now and move on to unary_ufunc

This isn't right. The unary_func_* ops in pattern.h seem to be somewhat redundant with elementwise_util.h.

swolchok · 2025-03-19T19:51:38Z

Plan above updated to reflect resolution of my confusion about whether to start with elementwise_util or unary_ufunc_* . (In short, unary_ufunc_* will call through to elementwise_util.)

kimishpatel · 2025-03-19T19:58:59Z

so that we could seamlessly use the lambda with Vectorized:

Do you have an example for this, for say add refactor that you described in the summary of the RFC

kimishpatel · 2025-03-19T21:01:37Z

template lambda that could be instantiated with Vectorized

Do you have an example of this

kimishpatel · 2025-03-19T21:02:12Z

something needs to tell us it's OK to use Vectorized.

Can we not have implementation of Vectorized that is just scalar? This is already the case in pytorch core and et, no?

swolchok · 2025-03-19T21:16:08Z

template lambda that could be instantiated with Vectorized

Do you have an example of this

[](auto x) { return c10::math::exp(x); }, where c10::math contains using std::exp; and also template <typename T> auto exp(at::Vectorized<T> v) { return v.exp(); }

swolchok · 2025-03-19T21:17:02Z

something needs to tell us it's OK to use Vectorized.

Can we not have implementation of Vectorized that is just scalar? This is already the case in pytorch core and et, no?

I don't particularly want to sign up to ensure that we generate code for that that is just as good as writing scalar code directly.

swolchok · 2025-03-20T00:37:22Z

I'm running into trouble with my intended implementation here. Apparently, SFINAE can't be used together with generic lambdas to detect whether they will actually compile when passed an argument of a particular type. See #9432 ; I expect to resolve this tomorrow after sleeping on the problem.

swolchok · 2025-03-20T02:58:47Z

Updated #9432 with documentation about the SFINAE + generic lambda issue. I suspect I only ran into this because of the way I sequenced my stack; if I move the following steps in my plan before vectorizing elementwise_util, I expect better results:

Establish c10::math namespace in upstream PyTorch
Sync c10::math header into ET core so that we can use c10::math namespace in ExecuTorch portable ops, which are part of core
Update unary_ufunc ops to use c10::math ops instead of std so that they can cleanly handle Vectorized

kimishpatel · 2025-03-20T03:12:00Z

template lambda that could be instantiated with Vectorized

Do you have an example of this

[](auto x) { return c10::math::exp(x); }, where c10::math contains using std::exp; and also template <typename T> auto exp(at::Vectorized<T> v) { return v.exp(); }

ok I get that but i dont get if there will be other functions that will call into this lambda using Vectorized vs pure scalar (like in Loops.h in PyTorch). Maybe I can just wait for you to have a working version

swolchok · 2025-03-26T21:55:32Z

implementation of Vectorized that is just scalar

in particular it's important to note that if you wanted at::vec::Vectorized to work nicely in this mode, you would presumably have a default Vectorized::size() of 1 so that Vectorized code was just scalar code if a specialized implementation was not available, which is not how it actually works

kimishpatel · 2025-03-27T03:11:30Z

implementation of Vectorized that is just scalar

in particular it's important to note that if you wanted at::vec::Vectorized to work nicely in this mode, you would presumably have a default Vectorized::size() of 1 so that Vectorized code was just scalar code if a specialized implementation was not available, which is not how it actually works

Yeah thats a bit unfortunate.

ok so separately, it can still be called portable fallback though? I didnt look through the header to ensure it doesnt make platform specific assumptions but I would have guessed it could be considered portable as in it can compile for platform that has c/cpp compiler with > c++17 support available.

digantdesai · 2025-03-31T19:37:19Z

Just curious, did you consider copying a portable op into an optimized op and SIMD-ifying it there? The distinction will be instead of vector w/ scalar fall back in the portable_op.cpp file it will be in optimized.yml with portable.yml fallback during the selective build.

swolchok · 2025-03-31T19:51:53Z

did you consider copying a portable op into an optimized op and SIMD-ifying it there

that would be worse because we would then have copy/pasted code.

…ET_USE_PYTORCH_HEADERS (#9384) Enables following diff. First real step of #9241.

Mixed dtype should be uncommon. Here is how we can specialize for the common case. Prepares us to tackle #9241 . Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. To check performance, I edited examples/models/toy_model/model.py so that MulModule used inputs of size 3000, 2000 instead of 3, 2. I exported it with `python3 -m examples.portable.scripts.export --model_name mul` and saved the resulting `mul.pte`. Then I built in release mode with optimized kernels on, but with mul.out removed from kernels/optimized/optimized.yaml, so that we would use the optimized_portable_kernels build of kernels/portable/op_mul.cpp. Finally, I ran 3 trials on my M1 Macbook Pro using `cmake-out/executor_runner --model_path mul3kby2k.pte --num_executions 1000 --cpu_threads 2`. Resulting times for 1000 iterations in ms: Previous diff: 8295, 8187, 8139 This diff: 2953, 2806, 2861 (For comparison, the actual optimized mul kernel took around 1000 ms to run 1000 iterations, and #9432 later in the stack arrived at similar numbers.)

swolchok added actionable Items in the backlog waiting for an appropriate impl/fix module: kernels Issues related to kernel libraries and utilities, and code under kernels/ labels Mar 13, 2025

swolchok self-assigned this Mar 13, 2025

This was referenced Mar 19, 2025

Make PyTorch headers available in optimized_portable_kernels, define ET_USE_PYTORCH_HEADERS #9384

Merged

RFC: Specialize for non-mixed-dtype in elementwise_util #9388

Merged

swolchok mentioned this issue Mar 25, 2025

Parallelize portable ops if threadpool is available, with fallback to parallel_for-as-for-loop #8932

Closed

swolchok mentioned this issue Mar 26, 2025

Add vectorization in elementwise_util #9432

Draft

swolchok added a commit that referenced this issue Apr 2, 2025

Make PyTorch headers available in optimized_portable_kernels, define …

40443a9

…ET_USE_PYTORCH_HEADERS (#9384) Enables following diff. First real step of #9241.

kirklandsign pushed a commit that referenced this issue Apr 11, 2025

Make PyTorch headers available in optimized_portable_kernels, define …

c4808ba

…ET_USE_PYTORCH_HEADERS (#9384) Enables following diff. First real step of #9241.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize optimized_portable_ops versions of portable ops? #9241

Vectorize optimized_portable_ops versions of portable ops? #9241

swolchok commented Mar 13, 2025 •

edited

Loading

swolchok commented Mar 18, 2025 •

edited

Loading

swolchok commented Mar 18, 2025

swolchok commented Mar 18, 2025

swolchok commented Mar 19, 2025

kimishpatel commented Mar 19, 2025

kimishpatel commented Mar 19, 2025

kimishpatel commented Mar 19, 2025

swolchok commented Mar 19, 2025

swolchok commented Mar 19, 2025

swolchok commented Mar 20, 2025

swolchok commented Mar 20, 2025

kimishpatel commented Mar 20, 2025

swolchok commented Mar 26, 2025

kimishpatel commented Mar 27, 2025

digantdesai commented Mar 31, 2025 •

edited

Loading

swolchok commented Mar 31, 2025

Vectorize optimized_portable_ops versions of portable ops? #9241

Vectorize optimized_portable_ops versions of portable ops? #9241

Comments

swolchok commented Mar 13, 2025 • edited Loading

🚀 The feature, motivation and pitch

RFC

swolchok commented Mar 18, 2025 • edited Loading

swolchok commented Mar 18, 2025

swolchok commented Mar 18, 2025

swolchok commented Mar 19, 2025

kimishpatel commented Mar 19, 2025

kimishpatel commented Mar 19, 2025

kimishpatel commented Mar 19, 2025

swolchok commented Mar 19, 2025

swolchok commented Mar 19, 2025

swolchok commented Mar 20, 2025

swolchok commented Mar 20, 2025

kimishpatel commented Mar 20, 2025

swolchok commented Mar 26, 2025

kimishpatel commented Mar 27, 2025

digantdesai commented Mar 31, 2025 • edited Loading

swolchok commented Mar 31, 2025

swolchok commented Mar 13, 2025 •

edited

Loading

swolchok commented Mar 18, 2025 •

edited

Loading

digantdesai commented Mar 31, 2025 •

edited

Loading