This repository was archived by the owner on Dec 22, 2021. It is now read-only.
Rounding Average instructions #126
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
Rounding Average of two integer inputs, defined as
avg(a, b) := (a + b + 1) >> 1
, is a common operation in fixed-point numerical algorithms, such as video- and audio-codecs, and image filtering. Direct implementation of Rounding Average in SIMD instruction sets following the formula(a + b + 1) >> 1
is tricky, because while the suma + b + 1
can overflow the datatype of inputs, the final result always fits into the same datatype. To avoid the expensive work-around of computinga + b + 1
in higher precision (e.g. extending inputs from 8-bit elements to 16-bit elements for the computation), all common SIMD instruction sets provide some forms of Rounding Average instructions.This PR introduce two new WebAssembly instructions for Rounding Average operations,
i8x16.avgr_u
andi16x8.avgr_u
, which operate on vectors of unsigned 8-bit and unsigned 16-bit integers accordingly. These instructions match the universally supported across x86, ARM, and POWER forms of the Rounding Average operation.[October 31 update] Applications
Below are examples of optimized libraries using close equivalents of the proposed
i8x16.avgr_u
andi16x8.avgr_u
instructions:Mapping to Common Instruction Sets
This section illustrates how the new WebAssembly instructions can be lowered on common instruction sets. However, these patterns are provided only for convenience, compliant WebAssembly implementations do not have to follow the same code generation patterns.
x86/x86-64 processors with AVX instruction set
y = i8x16.avgr_u(a, b)
is lowered toVPAVGB xmm_y, xmm_a, xmm_b
y = i16x8.avgr_u(a, b)
is lowered toVPAVGW xmm_y, xmm_a, xmm_b
x86/x86-64 processors with SSE2 instruction set
a = i8x16.avgr_u(a, b)
is lowered toPAVGB xmm_a, xmm_b
y = i8x16.avgr_u(a, b)
is lowered toMOVDQA xmm_y, xmm_a + PAVGB xmm_y, xmm_b
a = i16x8.avgr_u(a, b)
is lowered toPAVGW xmm_a, xmm_b
y = i16x8.avgr_u(a, b)
is lowered toMOVDQA xmm_y, xmm_a + PAVGW xmm_y, xmm_b
ARM64 processors
y = i8x16.avgr_u(a, b)
is lowered toURHADD Vy.16B, Va.16B, Vb.16B
y = i16x8.avgr_u(a, b)
is lowered toURHADD Vy.8H, Va.8H, Vb.8H
ARMv7 processors with NEON instruction set
y = i8x16.avgr_u(a, b)
is lowered toVRHADD.U8 Qy, Qa, Qb
y = i16x8.avgr_u(a, b)
is lowered toVRHADD.U16 Qy, Qa, Qb
POWER processors with VMX (Altivec) instruction set
y = i8x16.avgr_u(a, b)
is lowered toVAVGUB VRy, VRa, VRb
y = i16x8.avgr_u(a, b)
is lowered toVAVGUH VRy, VRa, VRb
MIPS processors with MSA instruction set
y = i8x16.avgr_u(a, b)
is lowered toAVER_U.B Wy, Wa, Wb
y = i16x8.avgr_u(a, b)
is lowered toAVER_U.H Wy, Wa, Wb