Proposal to add fixed-point multiplication instructions #221

bjacob · 2020-05-05T19:47:41Z

(Branched from Issue #175)

It would be useful to have fixed-point multiplication instructions, e.g. 32x32=32 and 16x16=16, similar to ARM SQRDMULH.

Some may think that the availability of a 32x32=64 integer multiplication (Issue #175) would remove the need for that, but that would be sub-optimal: staying within 32bit means doing 4 scalar operations per 128-bit vector operation, and most applications want to use the rounding flavor (SQRDMULH not SQDMULH) which would require a few more instructions to emulate if the instruction is missing, which in practice would result in applications making compromises between accuracy and performance.

(This is critical to integer-quantized neural network applications as performed by TensorFlow Lite using the ruy matrix multiplication library, see e.g. the usage of these instructions here, https://github.com/google/ruy/blob/57e64b4c8f32e813ce46eb495d13ee301826e498/ruy/kernel_arm64.cc#L517 )

dtig · 2020-05-08T22:11:26Z

Thanks for filing this issue, as we're at Phase 3 of the current SIMD proposal we've put a soft freeze on addition of new operations as discussed in #203. Some interesting questions to answer - the issue description highlights the usage of these instructions in neural network applications, are there other applications that would benefit from this addition? What would the corresponding codegen look like for the 32x32=32 case on Intel platforms?

nfrechette · 2020-05-09T02:06:28Z

One use case is to use it with SSE4 (or is it AVX?) to perform a logical shift per SIMD lane. With SSE, logical shifts either take an immediate value and all lanes are shifted by the same amount or they take a shift amount as a u64 value and all lanes are shifted by the same amount. It is thus not possible to shift each lane by a different amount. One way to achieve this is with integer multiplication but it is only worth it when the intrinsic is available. It avoids the need to swizzle each lane, shift, and reconstruct. I'm not sure if it's faster with a multiplication but it uses a lot fewer instructions and registers and it inlines better. I intend to use this trick in my decompression code path where a vector3 is packed in a variable number of bits (each lane having the same number of bits). Due to bit alignment, when the value is loaded, it needs to be shifted. By using the bit offset, a lookup table can provide a shift value per lane.

bjacob · 2020-05-11T18:44:15Z

Thanks for filing this issue, as we're at Phase 3 of the current SIMD proposal

Real-world application developers like me are only going to start looking at the proposal once it's far enough into implementation. To say that the instruction set is soft frozen at this stage, is to say that it will only be marginally informed by real-world usage.

are there other applications that would benefit from this addition?

Some are listed here,
https://en.wikipedia.org/wiki/Fixed-point_arithmetic#Software_application_examples

Multiple CPU architectures support it:
ARM: SQRDMULH
MIPS: MULR_Q
x86: has an instruction, pmulhrsw / _mm_mulhrs_epi16 , that only supports the 16-bit case and only rounds towards positive infinity. It is used in the 16-bit neon_2_sse code linked below.

What would the corresponding codegen look like for the 32x32=32 case on Intel platforms?

Intel's neon_2_sse header implements it as:
16-bit: https://github.com/intel/ARM_NEON_2_x86_SSE/blob/0f4e857421c964826def1820bbfe0707f73ebda4/NEON_2_SSE.h#L4306-L4314
32-bit: https://github.com/intel/ARM_NEON_2_x86_SSE/blob/0f4e857421c964826def1820bbfe0707f73ebda4/NEON_2_SSE.h#L4287-L4304

dtig · 2020-05-20T23:27:08Z

Thanks for filing this issue, as we're at Phase 3 of the current SIMD proposal

Real-world application developers like me are only going to start looking at the proposal once it's far enough into implementation. To say that the instruction set is soft frozen at this stage, is to say that it will only be marginally informed by real-world usage.

While we've only progressed in the official phases, the Chrome implementation for this has been supporting the latest version of this proposal for at least more than a year now. Moving to the implementation phase for this proposal specifically required that this was useful and performant for multiple real world use cases. So I would argue that this proposal is informed by real-world usage, though I sympathize that it may not be optimal for your particular use case. The reason for a soft freeze on the opcodes is to give some room for implementations, and tools to catch up to the current proposal, and also given the nature of the SIMD proposal it is possible to have a long tail of operations, so unfortunately we do have to draw a line in the sand in the interest of forward progress.

That said, we did discuss in #203 that if there are very compelling reasons to consider adding new operations (that were not already filed at the time), we should evaluate them on a case-by-case basis. If this is something that you would like to push for, please submit a PR with the proposed semantics.

are there other applications that would benefit from this addition?

Some are listed here,
https://en.wikipedia.org/wiki/Fixed-point_arithmetic#Software_application_examples

Multiple CPU architectures support it:
ARM: SQRDMULH
MIPS: MULR_Q
x86: has an instruction, pmulhrsw / _mm_mulhrs_epi16 , that only supports the 16-bit case and only rounds towards positive infinity. It is used in the 16-bit neon_2_sse code linked below.

What would the corresponding codegen look like for the 32x32=32 case on Intel platforms?

Intel's neon_2_sse header implements it as:
16-bit: https://github.com/intel/ARM_NEON_2_x86_SSE/blob/0f4e857421c964826def1820bbfe0707f73ebda4/NEON_2_SSE.h#L4306-L4314
32-bit: https://github.com/intel/ARM_NEON_2_x86_SSE/blob/0f4e857421c964826def1820bbfe0707f73ebda4/NEON_2_SSE.h#L4287-L4304

Maratyszcza · 2021-01-14T21:22:18Z

This is covered by #365

bjacob mentioned this issue May 5, 2020

Proposal to add mul 32x32=64 #175

Closed

Maratyszcza mentioned this issue Sep 25, 2020

Saturating Rounding Q-format Multiplication #365

Merged

ngzhian closed this as completed Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal to add fixed-point multiplication instructions #221

Proposal to add fixed-point multiplication instructions #221

bjacob commented May 5, 2020 •

edited

Loading

dtig commented May 8, 2020

nfrechette commented May 9, 2020

bjacob commented May 11, 2020

dtig commented May 20, 2020

Maratyszcza commented Jan 14, 2021

Proposal to add fixed-point multiplication instructions #221

Proposal to add fixed-point multiplication instructions #221

Comments

bjacob commented May 5, 2020 • edited Loading

dtig commented May 8, 2020

nfrechette commented May 9, 2020

bjacob commented May 11, 2020

dtig commented May 20, 2020

Maratyszcza commented Jan 14, 2021

bjacob commented May 5, 2020 •

edited

Loading