-
Notifications
You must be signed in to change notification settings - Fork 528
Vectorize optimized_portable_ops versions of portable ops? #9241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Sketch of more detailed plan:
|
This is trickier than I remembered currently. We outline our loads in elementwise_util.h in the name of build time and code size; getting to a point where vectorization would be the next item on the list would take some time. Accordingly, I am going to skip elementwise_util.h for now and move on to unary_ufunc; will come back to elementwise afterwards. |
This isn't right. The unary_func_* ops in pattern.h seem to be somewhat redundant with elementwise_util.h. |
Plan above updated to reflect resolution of my confusion about whether to start with elementwise_util or unary_ufunc_* . (In short, unary_ufunc_* will call through to elementwise_util.) |
Do you have an example for this, for say add refactor that you described in the summary of the RFC |
Do you have an example of this |
Can we not have implementation of Vectorized that is just scalar? This is already the case in pytorch core and et, no? |
|
I don't particularly want to sign up to ensure that we generate code for that that is just as good as writing scalar code directly. |
I'm running into trouble with my intended implementation here. Apparently, SFINAE can't be used together with generic lambdas to detect whether they will actually compile when passed an argument of a particular type. See #9432 ; I expect to resolve this tomorrow after sleeping on the problem. |
Updated #9432 with documentation about the SFINAE + generic lambda issue. I suspect I only ran into this because of the way I sequenced my stack; if I move the following steps in my plan before vectorizing elementwise_util, I expect better results:
|
ok I get that but i dont get if there will be other functions that will call into this lambda using |
in particular it's important to note that if you wanted at::vec::Vectorized to work nicely in this mode, you would presumably have a default Vectorized::size() of 1 so that |
Yeah thats a bit unfortunate. ok so separately, it can still be called portable fallback though? I didnt look through the header to ensure it doesnt make platform specific assumptions but I would have guessed it could be considered portable as in it can compile for platform that has c/cpp compiler with > c++17 support available. |
Just curious, did you consider copying a portable op into an optimized op and SIMD-ifying it there? The distinction will be instead of vector w/ scalar fall back in the portable_op.cpp file it will be in optimized.yml with portable.yml fallback during the selective build. |
that would be worse because we would then have copy/pasted code. |
Mixed dtype should be uncommon. Here is how we can specialize for the common case. Prepares us to tackle #9241 . Test Plan: automated tests on this PR verify we didn't break the now-deprecated runtime_out_dtypes mode; tests on the next PR will verify that everything works after migration. Also included migration for exactly one operator, op_mul, to verify that the new code compiles. To check performance, I edited examples/models/toy_model/model.py so that MulModule used inputs of size 3000, 2000 instead of 3, 2. I exported it with `python3 -m examples.portable.scripts.export --model_name mul` and saved the resulting `mul.pte`. Then I built in release mode with optimized kernels on, but with mul.out removed from kernels/optimized/optimized.yaml, so that we would use the optimized_portable_kernels build of kernels/portable/op_mul.cpp. Finally, I ran 3 trials on my M1 Macbook Pro using `cmake-out/executor_runner --model_path mul3kby2k.pte --num_executions 1000 --cpu_threads 2`. Resulting times for 1000 iterations in ms: Previous diff: 8295, 8187, 8139 This diff: 2953, 2806, 2861 (For comparison, the actual optimized mul kernel took around 1000 ms to run 1000 iterations, and #9432 later in the stack arrived at similar numbers.)
🚀 The feature, motivation and pitch
Similarly to #8932, we should be able to conditionally compile portable ops to do some vectorization. I imagine that this would look like either passing a second lambda to our util functions, or perhaps passing template lambdas that we then could use for both some scalar T and also
Vectorized<T>
. The second option would require us to get an std-workalike interface to Vectorized operations so that things likeexp
would work seemlessly, which probably would have a similar solution to pytorch/pytorch#144495 .RFC
As a concrete example, op_add currently calls a util workhorse function with a lambda:
We could imagine instead making the call look like this, with a template lambda, so that we could seamlessly use the lambda with Vectorized:
A second, harder example is
op_exp
:I think ideally we would find a solution to the above-mentioned PyTorch issue and then write this as
using a template lambda that could be instantiated with either a scalar or Vectorized, as outlined above.
cc @larryliu0820 @manuelcandales
The text was updated successfully, but these errors were encountered: