-
Notifications
You must be signed in to change notification settings - Fork 43
Proposal to add fixed-point multiplication instructions #221
Comments
Thanks for filing this issue, as we're at Phase 3 of the current SIMD proposal we've put a soft freeze on addition of new operations as discussed in #203. Some interesting questions to answer - the issue description highlights the usage of these instructions in neural network applications, are there other applications that would benefit from this addition? What would the corresponding codegen look like for the 32x32=32 case on Intel platforms? |
One use case is to use it with SSE4 (or is it AVX?) to perform a logical shift per SIMD lane. With SSE, logical shifts either take an immediate value and all lanes are shifted by the same amount or they take a shift amount as a u64 value and all lanes are shifted by the same amount. It is thus not possible to shift each lane by a different amount. One way to achieve this is with integer multiplication but it is only worth it when the intrinsic is available. It avoids the need to swizzle each lane, shift, and reconstruct. I'm not sure if it's faster with a multiplication but it uses a lot fewer instructions and registers and it inlines better. I intend to use this trick in my decompression code path where a vector3 is packed in a variable number of bits (each lane having the same number of bits). Due to bit alignment, when the value is loaded, it needs to be shifted. By using the bit offset, a lookup table can provide a shift value per lane. |
Real-world application developers like me are only going to start looking at the proposal once it's far enough into implementation. To say that the instruction set is soft frozen at this stage, is to say that it will only be marginally informed by real-world usage.
Some are listed here, Multiple CPU architectures support it:
Intel's neon_2_sse header implements it as: |
While we've only progressed in the official phases, the Chrome implementation for this has been supporting the latest version of this proposal for at least more than a year now. Moving to the implementation phase for this proposal specifically required that this was useful and performant for multiple real world use cases. So I would argue that this proposal is informed by real-world usage, though I sympathize that it may not be optimal for your particular use case. The reason for a soft freeze on the opcodes is to give some room for implementations, and tools to catch up to the current proposal, and also given the nature of the SIMD proposal it is possible to have a long tail of operations, so unfortunately we do have to draw a line in the sand in the interest of forward progress. That said, we did discuss in #203 that if there are very compelling reasons to consider adding new operations (that were not already filed at the time), we should evaluate them on a case-by-case basis. If this is something that you would like to push for, please submit a PR with the proposed semantics.
|
This is covered by #365 |
(Branched from Issue #175)
It would be useful to have fixed-point multiplication instructions, e.g. 32x32=32 and 16x16=16, similar to ARM SQRDMULH.
Some may think that the availability of a 32x32=64 integer multiplication (Issue #175) would remove the need for that, but that would be sub-optimal: staying within 32bit means doing 4 scalar operations per 128-bit vector operation, and most applications want to use the rounding flavor (SQRDMULH not SQDMULH) which would require a few more instructions to emulate if the instruction is missing, which in practice would result in applications making compromises between accuracy and performance.
(This is critical to integer-quantized neural network applications as performed by TensorFlow Lite using the ruy matrix multiplication library, see e.g. the usage of these instructions here, https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L517 )
The text was updated successfully, but these errors were encountered: