From 2a62b90411f1276c88eb4d8e1942509f3637491e Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Thu, 19 Apr 2018 16:24:25 +0200 Subject: [PATCH 1/9] Added permutation and shuffling primitives --- proposals/simd/SIMD.md | 32 +++++++++++++++++++++++++++++++- 1 file changed, 31 insertions(+), 1 deletion(-) diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md index afcdc9c09..77302974c 100644 --- a/proposals/simd/SIMD.md +++ b/proposals/simd/SIMD.md @@ -211,10 +211,28 @@ The input lane value, `x`, is interpreted the same way as for the splat instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored. ### Shuffle lanes + +#### Immediate permutation rule +* `v8x16.permute(a: v128, s: LaneIdx16[16]) -> v128` +* `v16x8.permute(a: v128, s: LaneIdx8[8]) -> v128` +* `v32x4.permute(a: v128, s: LaneIdx4[4]) -> v128` +* `v64x2.permute(a: v128, s: LaneIdx2[2]) -> v128` * `v8x16.shuffle(a: v128, b: v128, s: LaneIdx32[16]) -> v128` +* `v16x8.shuffle(a: v128, b: v128, s: LaneIdx16[8]) -> v128` +* `v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128` +* `v64x2.shuffle(a: v128, b: v128, s: LaneIdx4[2]) -> v128` -Create vector with lanes selected from the lanes of two input vectors: +Create vector with lanes selected from the lanes of the input vector: + +```python +def S.permute(a, s): + result = S.New() + for i in range(S.Lanes): + result[i] = a[s[i]] + return result +``` +Create vector with lanes selected from the lanes of two input vectors: ```python def S.shuffle(a, b, s): result = S.New() @@ -226,6 +244,18 @@ def S.shuffle(a, b, s): return result ``` +#### Variable permutation rule +* `v8x16.permuteVar(a: v128, s: v128) -> v128` +* `v16x8.permuteVar(a: v128, s: v128) -> v128` +* `v32x4.permuteVar(a: v128, s: v128) -> v128` +* `v64x2.permuteVar(a: v128, s: v128) -> v128` +* `v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128` +* `v16x8.shuffleVar(a: v128, b: v128, s: v128) -> v128` +* `v32x4.shuffleVar(a: v128, b: v128, s: v128) -> v128` +* `v64x2.shuffleVar(a: v128, b: v128, s: v128) -> v128` + +Same as non-`Var`, but where indices are runtime values. + ## Integer arithmetic Wrapping integer arithmetic discards the high bits of the result. From 219cc12509d7238ae89f72f1574c0d7cfba513f4 Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Thu, 19 Apr 2018 16:43:23 +0200 Subject: [PATCH 2/9] Add reduction paragraph reductions are computed with permutes --- proposals/simd/SIMD.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md index 77302974c..98ad29ab8 100644 --- a/proposals/simd/SIMD.md +++ b/proposals/simd/SIMD.md @@ -705,3 +705,28 @@ Lane-wise saturating conversion from floating point to integer using the IEEE resulting lane is 0. If the rounded integer value of a lane is outside the range of the destination type, the result is saturated to the nearest representable integer value. + + +## Reductions + +There is no instruction for reductions. +Instead, one can use permutations to reduce lane-wise operations like `add`, `min`, `max`, `and`, `or`... + +Here is an example to reduce add on f32x4: +``` +get_local 0 +v32x4.permute 2 3 0 1 ;; swap the lower part with the higher part of the vector +f32x4.add +get_local 0 +v32x4.permute 1 0 3 2 ;; swap the 2 first elements together, and the 2 last elements together +f32x4.add +f32x4.extract_lane 0 ;; extract the first element +``` + +Here is an example to reduce add on f64x2: +``` +get_local 0 +v64x2.permute 1 0 ;; swap the lower part with the higher part of the vector +f64x2.add +f64x2.extract_lane 0 ;; extract the first element +``` From f25b996d33456d5df457939f3878415dbcaddc5f Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Thu, 19 Apr 2018 23:57:19 +0200 Subject: [PATCH 3/9] Shorter encoding for reduce add on f32x4 --- proposals/simd/SIMD.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md index 98ad29ab8..e194c01c3 100644 --- a/proposals/simd/SIMD.md +++ b/proposals/simd/SIMD.md @@ -715,7 +715,7 @@ Instead, one can use permutations to reduce lane-wise operations like `add`, `mi Here is an example to reduce add on f32x4: ``` get_local 0 -v32x4.permute 2 3 0 1 ;; swap the lower part with the higher part of the vector +v64x2.permute 1 0 ;; swap the lower part with the higher part of the vector f32x4.add get_local 0 v32x4.permute 1 0 3 2 ;; swap the 2 first elements together, and the 2 last elements together From 9146aeb2aaa77faa62fda7fd9edb76170eccc6ed Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Wed, 8 Aug 2018 23:30:52 +0200 Subject: [PATCH 4/9] Removed polemical shuffle instructions --- proposals/simd/SIMD.md | 21 --------------------- 1 file changed, 21 deletions(-) diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md index e194c01c3..c6d6b0536 100644 --- a/proposals/simd/SIMD.md +++ b/proposals/simd/SIMD.md @@ -213,25 +213,11 @@ instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored. ### Shuffle lanes #### Immediate permutation rule -* `v8x16.permute(a: v128, s: LaneIdx16[16]) -> v128` -* `v16x8.permute(a: v128, s: LaneIdx8[8]) -> v128` -* `v32x4.permute(a: v128, s: LaneIdx4[4]) -> v128` -* `v64x2.permute(a: v128, s: LaneIdx2[2]) -> v128` * `v8x16.shuffle(a: v128, b: v128, s: LaneIdx32[16]) -> v128` * `v16x8.shuffle(a: v128, b: v128, s: LaneIdx16[8]) -> v128` * `v32x4.shuffle(a: v128, b: v128, s: LaneIdx8[4]) -> v128` * `v64x2.shuffle(a: v128, b: v128, s: LaneIdx4[2]) -> v128` -Create vector with lanes selected from the lanes of the input vector: - -```python -def S.permute(a, s): - result = S.New() - for i in range(S.Lanes): - result[i] = a[s[i]] - return result -``` - Create vector with lanes selected from the lanes of two input vectors: ```python def S.shuffle(a, b, s): @@ -245,14 +231,7 @@ def S.shuffle(a, b, s): ``` #### Variable permutation rule -* `v8x16.permuteVar(a: v128, s: v128) -> v128` -* `v16x8.permuteVar(a: v128, s: v128) -> v128` -* `v32x4.permuteVar(a: v128, s: v128) -> v128` -* `v64x2.permuteVar(a: v128, s: v128) -> v128` * `v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128` -* `v16x8.shuffleVar(a: v128, b: v128, s: v128) -> v128` -* `v32x4.shuffleVar(a: v128, b: v128, s: v128) -> v128` -* `v64x2.shuffleVar(a: v128, b: v128, s: v128) -> v128` Same as non-`Var`, but where indices are runtime values. From 89b72454310b4bae13f51ae547970b1962c5baff Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Sat, 11 Aug 2018 15:31:47 +0200 Subject: [PATCH 5/9] snake case for `shuffle_var` --- proposals/simd/SIMD.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md index c6d6b0536..6b2fa5a43 100644 --- a/proposals/simd/SIMD.md +++ b/proposals/simd/SIMD.md @@ -231,9 +231,9 @@ def S.shuffle(a, b, s): ``` #### Variable permutation rule -* `v8x16.shuffleVar(a: v128, b: v128, s: v128) -> v128` +* `v8x16.shuffle_var(a: v128, b: v128, s: v128) -> v128` -Same as non-`Var`, but where indices are runtime values. +Same as non-`var`, but where indices are runtime values. ## Integer arithmetic From 64dc7ae8d3b19c4166756163fc166564d43675e0 Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Thu, 13 Dec 2018 11:29:30 +0100 Subject: [PATCH 6/9] Update SIMD.md --- proposals/simd/SIMD.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md index 6b2fa5a43..e215d24ea 100644 --- a/proposals/simd/SIMD.md +++ b/proposals/simd/SIMD.md @@ -694,10 +694,12 @@ Instead, one can use permutations to reduce lane-wise operations like `add`, `mi Here is an example to reduce add on f32x4: ``` get_local 0 -v64x2.permute 1 0 ;; swap the lower part with the higher part of the vector +get_local 0 +v64x2.shuffle 1 0 ;; swap the lower part with the higher part of the vector f32x4.add get_local 0 -v32x4.permute 1 0 3 2 ;; swap the 2 first elements together, and the 2 last elements together +get_local 0 +v32x4.shuffle 1 0 3 2 ;; swap the 2 first elements together, and the 2 last elements together f32x4.add f32x4.extract_lane 0 ;; extract the first element ``` @@ -705,7 +707,8 @@ f32x4.extract_lane 0 ;; extract the first element Here is an example to reduce add on f64x2: ``` get_local 0 -v64x2.permute 1 0 ;; swap the lower part with the higher part of the vector +get_local 0 +v64x2.shuffle 1 0 ;; swap the lower part with the higher part of the vector f64x2.add f64x2.extract_lane 0 ;; extract the first element ``` From 12aed66888bdca14c2caee3c217e672458580c15 Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Sat, 2 Mar 2019 12:38:11 +0100 Subject: [PATCH 7/9] Fixed paragraph positionning --- proposals/simd/SIMD.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md index 68d3d1d0d..d129be149 100644 --- a/proposals/simd/SIMD.md +++ b/proposals/simd/SIMD.md @@ -288,6 +288,7 @@ instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored. #### Immediate permutation rule * `v8x16.shuffle(a: v128, b: v128, imm: ImmLaneIdx32[16]) -> v128` + Returns a new vector with lanes selected from the lanes of the two input vectors `a` and `b` specified in the 12 byte wide immediate mode operand `imm`. This instruction is encoded with 12 bytes providing the indices of the elements to @@ -295,6 +296,7 @@ return. The indices `i` in range `[0, 15]` select the `i`-th element of `a`. The indices in range `[16, 31]` select the `i - 16`-th element of `b`. * `v16x8.shuffle(a: v128, b: v128, imm: ImmLaneIdx16[8]) -> v128` + Returns a new vector with lanes selected from the lanes of the two input vectors `a` and `b` specified in the 3 byte wide immediate mode operand `imm`. This instruction is encoded with 3 bytes providing the indices of the elements to @@ -302,6 +304,7 @@ return. The indices `i` in range `[0, 7]` select the `i`-th element of `a`. The indices in range `[8, 15]` select the `i - 8`-th element of `b`. * `v32x4.shuffle(a: v128, b: v128, imm: ImmLaneIdx8[4]) -> v128` + Returns a new vector with lanes selected from the lanes of the two input vectors `a` and `b` specified in the 2 byte wide immediate mode operand `imm`. This instruction is encoded with 2 bytes providing the indices of the elements to @@ -309,6 +312,7 @@ return. The indices `i` in range `[0, 3]` select the `i`-th element of `a`. The indices in range `[4, 7]` select the `i - 4`-th element of `b`. * `v64x2.shuffle(a: v128, b: v128, imm: ImmLaneIdx4[2]) -> v128` + Returns a new vector with lanes selected from the lanes of the two input vectors `a` and `b` specified in the 1 byte wide immediate mode operand `imm`. This instruction is encoded with 1 bytes providing the indices of the elements to @@ -328,8 +332,10 @@ def S.shuffle(a, b, s): #### Variable permutation rule * `v8x16.permute_dyn(a: v128, s: v128) -> v128` + Returns a new vector with lanes selected from the lanes of the first input vector -`a` and specified in the second input vector `s`. +`a` and specified in the second input vector `s`. The indices from `s` are first +fit into the range `[0, 15]` via a modulo. ```python def S.permute_dyn(a, s): From c3ce95aa10fe78c62d9487e9a5884b95a3ce5c44 Mon Sep 17 00:00:00 2001 From: Florian Lemaitre Date: Sat, 2 Mar 2019 14:58:52 +0100 Subject: [PATCH 8/9] Fixed length of v8x16.shuffle immediate --- proposals/simd/SIMD.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/proposals/simd/SIMD.md b/proposals/simd/SIMD.md index d129be149..526ec72f1 100644 --- a/proposals/simd/SIMD.md +++ b/proposals/simd/SIMD.md @@ -290,8 +290,8 @@ instructions. For the `i8` and `i16` lanes, the high bits of `x` are ignored. * `v8x16.shuffle(a: v128, b: v128, imm: ImmLaneIdx32[16]) -> v128` Returns a new vector with lanes selected from the lanes of the two input vectors -`a` and `b` specified in the 12 byte wide immediate mode operand `imm`. This -instruction is encoded with 12 bytes providing the indices of the elements to +`a` and `b` specified in the 10 byte wide immediate mode operand `imm`. This +instruction is encoded with 10 bytes providing the indices of the elements to return. The indices `i` in range `[0, 15]` select the `i`-th element of `a`. The indices in range `[16, 31]` select the `i - 16`-th element of `b`. From 3152128de8d39ceaefcce7a08cc719b2d89a6f62 Mon Sep 17 00:00:00 2001 From: Lemaitre Date: Sun, 31 Mar 2019 17:38:04 +0200 Subject: [PATCH 9/9] Updated Binary and text encoding --- proposals/simd/BinarySIMD.md | 5 ++++- proposals/simd/TextSIMD.md | 5 ++++- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/proposals/simd/BinarySIMD.md b/proposals/simd/BinarySIMD.md index ff51a7f0f..0b6151fc1 100644 --- a/proposals/simd/BinarySIMD.md +++ b/proposals/simd/BinarySIMD.md @@ -167,4 +167,7 @@ The `v8x16.shuffle2_imm` instruction has 16 bytes after `simdop`. | `f64x2.convert_s/i64x2` | `0xb1`| - | | `f64x2.convert_u/i64x2` | `0xb2`| - | | `v8x16.shuffle1` | `0xc0`| - | -| `v8x16.shuffle2_imm` | `0xc1`| s:LaneIdx32[16] | \ No newline at end of file +| `v8x16.shuffle2_imm` | `0xcc`| s:LaneIdx32[16] | +| `v16x8.shuffle2_imm` | `0xcd`| s:LaneIdx16[8] | +| `v32x4.shuffle2_imm` | `0xce`| s:LaneIdx8[4] | +| `v64x2.shuffle2_imm` | `0xcf`| s:LaneIdx4[2] | diff --git a/proposals/simd/TextSIMD.md b/proposals/simd/TextSIMD.md index fc3a7e7d2..6fda0692f 100644 --- a/proposals/simd/TextSIMD.md +++ b/proposals/simd/TextSIMD.md @@ -20,8 +20,11 @@ The canonical text format used for printing `v128.const` instructions is v128.const i32x4 0xNNNNNNNN 0xNNNNNNNN 0xNNNNNNNN 0xNNNNNNNN ``` -### v8x16.shuffle2_imm +### Shuffling using immediate indices ``` v8x16.shuffle2_imm i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 i5 +v16x8.shuffle2_imm i4 i4 i4 i4 i4 i4 i4 i4 +v32x4.shuffle2_imm i3 i3 i3 i3 +v64x2.shuffle2_imm i2 i2 ```